{"id":13717238,"url":"https://github.com/salesforce/matchbox","last_synced_at":"2025-05-07T07:30:39.992Z","repository":{"id":66001691,"uuid":"127103438","full_name":"salesforce/matchbox","owner":"salesforce","description":"Write PyTorch code at the level of individual examples, then run it efficiently on minibatches.","archived":true,"fork":false,"pushed_at":"2022-02-12T14:34:47.000Z","size":135,"stargazers_count":484,"open_issues_count":6,"forks_count":23,"subscribers_count":20,"default_branch":"master","last_synced_at":"2025-04-09T23:04:47.460Z","etag":null,"topics":["deep-learning","minibatch","nlp","pytorch"],"latest_commit_sha":null,"homepage":"","language":"Python","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":null,"status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/salesforce.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":null,"license":"LICENSE.Apache-2.txt","code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":"CODEOWNERS","security":null,"support":null,"governance":null}},"created_at":"2018-03-28T07:45:23.000Z","updated_at":"2025-02-15T15:13:22.000Z","dependencies_parsed_at":"2023-02-23T03:15:19.155Z","dependency_job_id":null,"html_url":"https://github.com/salesforce/matchbox","commit_stats":{"total_commits":138,"total_committers":4,"mean_commits":34.5,"dds":"0.036231884057971064","last_synced_commit":"c91f4759dd8fce705f4e94e2bef4a37bdd107379"},"previous_names":[],"tags_count":0,"template":false,"template_full_name":null,"repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/salesforce%2Fmatchbox","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/salesforce%2Fmatchbox/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/salesforce%2Fmatchbox/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/salesforce%2Fmatchbox/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/salesforce","download_url":"https://codeload.github.com/salesforce/matchbox/tar.gz/refs/heads/master","host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":252833407,"owners_count":21811177,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["deep-learning","minibatch","nlp","pytorch"],"created_at":"2024-08-03T00:01:19.699Z","updated_at":"2025-05-07T07:30:39.610Z","avatar_url":"https://github.com/salesforce.png","language":"Python","funding_links":[],"categories":["Pytorch \u0026 related libraries｜Pytorch \u0026 相关库","Pytorch \u0026 related libraries","Python"],"sub_categories":["Other libraries｜其他库:","Other libraries:"],"readme":"# Matchbox\n\nMatchbox enables deep learning researchers to write PyTorch code at the level\nof individual examples, then run it efficiently on minibatches. It does this\nusing three components:\n- A `MaskedBatch` type, together with overloaded implementations of PyTorch\nmethods and neural network layers, keeps track of padding and masking for\nvariable-size data automatically. Use `dir(matchbox.MaskedBatch)` to see a list\nof supported methods.\n- A `@batch` decorator rewrites some Python control flow into a\n[SIMT](https://en.wikipedia.org/wiki/Single_instruction,_multiple_threads)-like\nform that includes execution masking and synchronization primitives.\n- Convenience methods like `batch_ones`, `split_dim`, and `causal_mask` support\ncommon use cases in dynamic neural network code in a way that benefits from\nthe more semantically meaningful shape information available with the\n`MaskedBatch` type. These are implemented both for batch and tensor objects,\nbecause all code written for Matchbox also works with plain `Tensor`s at batch\nsize one.\n\nThere is also a plugin for [torchtext](https://github.com/pytorch/text) and a\nwrapper for testing that Matchbox results are numerically equivalent to a loop\nover unbatched examples. See the `examples` and `test` directories for details.\n\n## Installation and requirements\nMatchbox is in early-release alpha. Use `python setup.py install` to install.\nPlease file or upvote issues to request new operation implementations, or feel\nfree to post one as a pull request. If Matchbox throws a `NotImplementedError`,\nthat means that a particular feature of an operation could be supported but\nisn't yet.\n\nMatchbox is developed on Python 3.6 and PyTorch 0.4. It contains compatibility\ncode that is intended to support PyTorch 0.3, but not all features will work.\nMatchbox also requires `gast`, `astor`, and `six`. Python 2 support is not an\nimmediate priority but we would welcome a PR.\n\n## Getting started\nThe first step to using Matchbox is to replace your import of\n`torch.nn.functional` with `matchbox.functional`:\n```python\nimport matchbox\nimport matchbox.functional as F\n# now calls like `F.softmax` refer to Matchbox's implementations\n```\nThis import also replaces methods on PyTorch `Tensor`s with Matchbox versions\nand injects `matchbox.functional` functions into `torch.nn` modules.\n\nNow you can write model code that applies to individual examples. If your code\nuses control flow, add the `@matchbox.batch` decorator to that function or\nclass (unfortunately, this doesn't yet work in the interactive interpreter\nor in Jupyter notebooks):\n```python\nfrom torch import nn\nclass RNN(nn.Module):\n    def __init__(self, size):\n        super().__init__()\n        self.cell = nn.RNNCell(size, size)\n    @matchbox.batch\n    def forward(self, x):\n        h = x.new_zeros(x.size(0), x.size(-1))\n        for xt in x.unbind(1):\n            h = self.cell(xt, h)\n        return h\n```\n\nYou can create input data to pass to this model in three ways. First, you can\npass them ordinary PyTorch `Tensor`s with batch size one. You can also pass\n`MaskedBatch` objects created manually, from lists of `Tensor`s with batch\nsize one (note that `torch.rand` should be wrapped in `Variable` on PyTorch\n0.3):\n```python\nimport torch\nfrom matchbox import MaskedBatch\nfrom random import randint\nb, t, c = 32, 10, 128\nmodel = RNN(c)\nx_unbatched = torch.rand(1, randint(1, t), c) # a single random example\nx_manual_batch = MaskedBatch.fromlist(\n    [torch.rand(1, randint(1, t), c) for i in range(b)], # list of examples\n    (True, False)) # dimension 1 is dynamic and dimension 2 is static\nh = model(x_unbatched)\nh = model(x_manual_batch)\n```\nAnd we provide a `torchtext` Field class that produces `MaskedBatch` objects\nwhen a dataset is iterated:\n```python\nfrom matchbox.data import MaskedBatchField\nTEXT = MaskedBatchField(batch_first=True)\ntrain, dev, test = datasets.IWSLT.splits(('.de', '.en'), (TEXT, TEXT))\nTEXT.build_vocab(train, max_size=50000)\ntrain_iter = data.BucketIterator(train, batch_size=32, device=-1)\nfor x_torchtext_batch in train_iter:\n    h = model(x_torchtext_batch)\n    # more training loop code\n```\n## Credit\nMatchbox is developed by James Bradbury at Salesforce Research.\nIt also contains Python source-wrangling code modified from Patrick Maupin\nand Berker Peksag's\n[AST observe-rewrite](https://github.com/berkerpeksag/astor) as well as\nGoogle Brain's [Tangent](https://github.com/google/tangent), a source-to-source\nautomatic differentiation package developed by Alex Wiltschko, Bart van\nMerrienboer and Dan Moldovan. The modified Tangent code is licensed under\nApache 2 while the rest of the codebase is licensed under three-clause BSD;\nsee `LICENSE.BSD-3.txt` and `LICENSE.Apache-2.txt`.\n\n## Limitations\nMatchbox only works on code that uses native PyTorch operators. In particular,\neverything that could vary between examples in a batch needs to be a `Tensor`\nin order for code written for individual examples to work with Matchbox. Support\nfor scalar tensors is significantly better in PyTorch 0.4. NumPy ops also need\nto be replaced with their native PyTorch equivalents.\n\nControl flow support is limited. While some of these limitations will be lifted\n(e.g., support for `continue` within `while` is straightforward to add) some\nconstructs are conceptually harder for Matchbox to support (e.g., `return` from\nwithin a `for`).\n\nThere’s also a long tail of less-common operations that haven’t been\nimplemented (plus bigger gaps, like convolutions). We will be continuously\nadding support for additional ops but also welcome pull requests.\n\n## Implementation details (batch semantics)\n`MaskedBatch` objects behave like PyTorch `Tensor`s, but represent a\ncollection (\"batch\") of `Tensor`s that may be of different sizes in some\nof their dimensions.\nMost of the time, `MaskedBatch` objects adhere to Matchbox's \"standard\"\nsemantics, but control flow constructions require a different \"SIMT\"\nsemantics.\n### Standard\nThe `dims` attribute is a `tuple` with a `bool` for each non-batch dimension,\nrepresenting whether that dimension is static (`False`) or dynamic (`True`).\n\nThe `data` attribute is a `Tensor` whose size is the batch size in the batch\ndimension, the size of all examples in static dimensions, and at least as large\nas the largest example in the batch in dynamic dimensions.\n\nThe `mask` attribute is a `Tensor` whose size is the batch size in the batch\ndimension, one in static dimensions, and at least as large as the largest\nexample in the batch in dynamic dimensions. Each entry in the mask corresponds\nto one or more entries in the data array (singleton, i.e., static, dimensions\nare broadcasted), with a one in the mask denoting that the corresponding data\nentries represent valid, meaningful data and a zero denoting that they do not.\n\nData values corresponding to zeros in the mask are not required to be zero,\nand operations should propagate masked data if doing so would not affect\nnon-masked parts of the output. Operations for which this is not the case\nshould first multiply their input data by the corresponding masks.\n### SIMT\nA one in the mask denotes that the corresponding data entries represent\ncurrently active data. A zero denotes that the corresponding data entries\nrepresent \"dormant\" data, which may be valid at a previous step of a loop\n(e.g., at a previous index along an external dimension that is being iterated\nover) or in another branch of a conditional. Currently, no dimensions in a\nSIMT batch may be dynamic, but support for this case will be added.\n\n## Future work\nIn addition to adding `MaskedBatch` support for more operations, we also plan\na separate `PackedBatch` type that can pack its data tensor along its batch\ndimension and one dynamic dimension and store a separate tensor of offsets.\nThis type will be natively compatible with cuDNN RNNs and saves memory relative\nto `MaskedBatch`, but will be slower for some operations.\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fsalesforce%2Fmatchbox","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Fsalesforce%2Fmatchbox","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fsalesforce%2Fmatchbox/lists"}