{"id":33169827,"url":"https://github.com/nikitakit/self-attentive-parser","last_synced_at":"2026-01-05T18:59:57.297Z","repository":{"id":38984026,"uuid":"131264017","full_name":"nikitakit/self-attentive-parser","owner":"nikitakit","description":"High-accuracy NLP parser with models for 11 languages.","archived":false,"fork":false,"pushed_at":"2022-01-10T15:48:34.000Z","size":85224,"stargazers_count":896,"open_issues_count":48,"forks_count":161,"subscribers_count":19,"default_branch":"master","last_synced_at":"2025-10-06T01:47:09.500Z","etag":null,"topics":["ai","machine-learning","natural-language-processing","nlp","parser","parsing"],"latest_commit_sha":null,"homepage":"https://parser.kitaev.io/","language":"Python","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":"mit","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/nikitakit.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":null,"license":"LICENSE","code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null}},"created_at":"2018-04-27T07:49:32.000Z","updated_at":"2025-10-05T02:44:04.000Z","dependencies_parsed_at":"2022-07-09T18:18:11.478Z","dependency_job_id":null,"html_url":"https://github.com/nikitakit/self-attentive-parser","commit_stats":null,"previous_names":[],"tags_count":3,"template":false,"template_full_name":null,"purl":"pkg:github/nikitakit/self-attentive-parser","repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/nikitakit%2Fself-attentive-parser","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/nikitakit%2Fself-attentive-parser/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/nikitakit%2Fself-attentive-parser/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/nikitakit%2Fself-attentive-parser/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/nikitakit","download_url":"https://codeload.github.com/nikitakit/self-attentive-parser/tar.gz/refs/heads/master","sbom_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/nikitakit%2Fself-attentive-parser/sbom","scorecard":{"id":687307,"data":{"date":"2025-08-18","repo":{"name":"github.com/nikitakit/self-attentive-parser","commit":"24435a156a64d7433829e9a500a81f46b7e58030"},"scorecard":{"version":"v5.2.1-41-g40576783","commit":"40576783fda6698350fcbbeaea760ff827433034"},"score":2.7,"checks":[{"name":"Token-Permissions","score":-1,"reason":"No tokens found","details":null,"documentation":{"short":"Determines if the project's workflows follow the principle of least privilege.","url":"https://github.com/ossf/scorecard/blob/40576783fda6698350fcbbeaea760ff827433034/docs/checks.md#token-permissions"}},{"name":"Maintained","score":0,"reason":"0 commit(s) and 0 issue activity found in the last 90 days -- score normalized to 0","details":null,"documentation":{"short":"Determines if the project is \"actively maintained\".","url":"https://github.com/ossf/scorecard/blob/40576783fda6698350fcbbeaea760ff827433034/docs/checks.md#maintained"}},{"name":"Packaging","score":-1,"reason":"packaging workflow not detected","details":["Warn: no GitHub/GitLab publishing workflow detected."],"documentation":{"short":"Determines if the project is published as a package that others can easily download, install, easily update, and uninstall.","url":"https://github.com/ossf/scorecard/blob/40576783fda6698350fcbbeaea760ff827433034/docs/checks.md#packaging"}},{"name":"Code-Review","score":0,"reason":"Found 0/30 approved changesets -- score normalized to 0","details":null,"documentation":{"short":"Determines if the project requires human code review before pull requests (aka merge requests) are merged.","url":"https://github.com/ossf/scorecard/blob/40576783fda6698350fcbbeaea760ff827433034/docs/checks.md#code-review"}},{"name":"SAST","score":0,"reason":"no SAST tool detected","details":["Warn: no pull requests merged into dev branch"],"documentation":{"short":"Determines if the project uses static code analysis.","url":"https://github.com/ossf/scorecard/blob/40576783fda6698350fcbbeaea760ff827433034/docs/checks.md#sast"}},{"name":"Dangerous-Workflow","score":-1,"reason":"no workflows found","details":null,"documentation":{"short":"Determines if the project's GitHub Action workflows avoid dangerous patterns.","url":"https://github.com/ossf/scorecard/blob/40576783fda6698350fcbbeaea760ff827433034/docs/checks.md#dangerous-workflow"}},{"name":"Binary-Artifacts","score":10,"reason":"no binaries found in the repo","details":null,"documentation":{"short":"Determines if the project has generated executable (binary) artifacts in the source repository.","url":"https://github.com/ossf/scorecard/blob/40576783fda6698350fcbbeaea760ff827433034/docs/checks.md#binary-artifacts"}},{"name":"CII-Best-Practices","score":0,"reason":"no effort to earn an OpenSSF best practices badge detected","details":null,"documentation":{"short":"Determines if the project has an OpenSSF (formerly CII) Best Practices Badge.","url":"https://github.com/ossf/scorecard/blob/40576783fda6698350fcbbeaea760ff827433034/docs/checks.md#cii-best-practices"}},{"name":"Pinned-Dependencies","score":-1,"reason":"no dependencies found","details":null,"documentation":{"short":"Determines if the project has declared and pinned the dependencies of its build process.","url":"https://github.com/ossf/scorecard/blob/40576783fda6698350fcbbeaea760ff827433034/docs/checks.md#pinned-dependencies"}},{"name":"Security-Policy","score":0,"reason":"security policy file not detected","details":["Warn: no security policy file detected","Warn: no security file to analyze","Warn: no security file to analyze","Warn: no security file to analyze"],"documentation":{"short":"Determines if the project has published a security policy.","url":"https://github.com/ossf/scorecard/blob/40576783fda6698350fcbbeaea760ff827433034/docs/checks.md#security-policy"}},{"name":"Vulnerabilities","score":10,"reason":"0 existing vulnerabilities detected","details":null,"documentation":{"short":"Determines if the project has open, known unfixed vulnerabilities.","url":"https://github.com/ossf/scorecard/blob/40576783fda6698350fcbbeaea760ff827433034/docs/checks.md#vulnerabilities"}},{"name":"License","score":10,"reason":"license file detected","details":["Info: project has a license file: LICENSE:0","Info: FSF or OSI recognized license: MIT License: LICENSE:0"],"documentation":{"short":"Determines if the project has defined a license.","url":"https://github.com/ossf/scorecard/blob/40576783fda6698350fcbbeaea760ff827433034/docs/checks.md#license"}},{"name":"Fuzzing","score":0,"reason":"project is not fuzzed","details":["Warn: no fuzzer integrations found"],"documentation":{"short":"Determines if the project uses fuzzing.","url":"https://github.com/ossf/scorecard/blob/40576783fda6698350fcbbeaea760ff827433034/docs/checks.md#fuzzing"}},{"name":"Signed-Releases","score":0,"reason":"Project has not signed or included provenance with any releases.","details":["Warn: release artifact models not signed: https://api.github.com/repos/nikitakit/self-attentive-parser/releases/10749795","Warn: release artifact models does not have provenance: https://api.github.com/repos/nikitakit/self-attentive-parser/releases/10749795"],"documentation":{"short":"Determines if the project cryptographically signs release artifacts.","url":"https://github.com/ossf/scorecard/blob/40576783fda6698350fcbbeaea760ff827433034/docs/checks.md#signed-releases"}},{"name":"Branch-Protection","score":0,"reason":"branch protection not enabled on development/release branches","details":["Warn: branch protection not enabled for branch 'master'"],"documentation":{"short":"Determines if the default and release branches are protected with GitHub's branch protection settings.","url":"https://github.com/ossf/scorecard/blob/40576783fda6698350fcbbeaea760ff827433034/docs/checks.md#branch-protection"}}]},"last_synced_at":"2025-08-22T01:15:24.681Z","repository_id":38984026,"created_at":"2025-08-22T01:15:24.681Z","updated_at":"2025-08-22T01:15:24.681Z"},"host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":285494607,"owners_count":27181443,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","status":"online","status_checked_at":"2025-11-20T02:00:05.334Z","response_time":54,"last_error":null,"robots_txt_status":"success","robots_txt_updated_at":"2025-07-24T06:49:26.215Z","robots_txt_url":"https://github.com/robots.txt","online":true,"can_crawl_api":true,"host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["ai","machine-learning","natural-language-processing","nlp","parser","parsing"],"created_at":"2025-11-16T01:00:35.513Z","updated_at":"2025-11-20T19:00:34.796Z","avatar_url":"https://github.com/nikitakit.png","language":"Python","funding_links":[],"categories":["Python","Tools"],"sub_categories":["Syntactic parsers"],"readme":"# Berkeley Neural Parser\n\nA high-accuracy parser with models for 11 languages, implemented in Python. Based on [Constituency Parsing with a Self-Attentive Encoder](https://arxiv.org/abs/1805.01052) from ACL 2018, with additional changes described in [Multilingual Constituency Parsing with Self-Attention and Pre-Training](https://arxiv.org/abs/1812.11760).\n\n**New February 2021:** Version 0.2.0 of the Berkeley Neural Parser is now out, with higher-quality pre-trained models for all languages. Inference now uses PyTorch instead of TensorFlow (training has always been PyTorch-only). Drops support for Python 2.7 and 3.5. Includes updated support for training and using your own parsers, based on your choice of [pre-trained model](https://huggingface.co/models).\n\n## Contents\n1. [Installation](#installation)\n2. [Usage](#usage)\n3. [Available Models](#available-models)\n4. [Training](#training)\n5. [Reproducing Experiments](#reproducing-experiments)\n6. [Citation](#citation)\n7. [Credits](#credits)\n\nIf you are primarily interested in training your own parsing models, skip to the [Training](#training) section of this README.\n\n## Installation\n\nTo install the parser, run the command:\n```bash\n$ pip install benepar\n```\n*Note: benepar 0.2.0 is a major upgrade over the previous version, and comes with entirely new and higher-quality parser models. If you are not ready to upgrade, you can pin your benepar version to [the previous release (0.1.3)](https://github.com/nikitakit/self-attentive-parser/tree/acl2019).*\n\nPython 3.6 (or newer) and [PyTorch](https://pytorch.org/) 1.6 (or newer) are required. See the PyTorch website for instruction on how to select between GPU-enabled and CPU-only versions of PyTorch; benepar will automatically use the GPU if it is available to pytorch.\n\nThe recommended way of using benepar is through integration with [spaCy](https://spacy.io/). If using spaCy, you should install a spaCy model for your language. For English, the installation command is:\n```sh\n$ python -m spacy download en_core_web_md\n```\n\nThe spaCy model is only used for tokenization and sentence segmentation. If language-specific analysis beyond parsing is not required, you may also forego a language-specific model and instead use a multi-language model that only performs tokenization and segmentation. [One such model](https://spacy.io/models/xx#xx_sent_ud_sm), newly added in spaCy 3.0, should work for English, German, Korean, Polish, and Swedish (but not Chinese, since it doesn't seem to support Chinese word segmentation).\n\nParsing models need to be downloaded separately, using the commands:\n```python\n\u003e\u003e\u003e import benepar\n\u003e\u003e\u003e benepar.download('benepar_en3')\n```\n\nSee the [Available Models](#available-models) section below for a full list of models.\n\n## Usage\n\n### Usage with spaCy (recommended)\n\nThe recommended way of using benepar is through its integration with spaCy:\n```python\n\u003e\u003e\u003e import benepar, spacy\n\u003e\u003e\u003e nlp = spacy.load('en_core_web_md')\n\u003e\u003e\u003e if spacy.__version__.startswith('2'):\n        nlp.add_pipe(benepar.BeneparComponent(\"benepar_en3\"))\n    else:\n        nlp.add_pipe(\"benepar\", config={\"model\": \"benepar_en3\"})\n\u003e\u003e\u003e doc = nlp(\"The time for action is now. It's never too late to do something.\")\n\u003e\u003e\u003e sent = list(doc.sents)[0]\n\u003e\u003e\u003e print(sent._.parse_string)\n(S (NP (NP (DT The) (NN time)) (PP (IN for) (NP (NN action)))) (VP (VBZ is) (ADVP (RB now))) (. .))\n\u003e\u003e\u003e sent._.labels\n('S',)\n\u003e\u003e\u003e list(sent._.children)[0]\nThe time for action\n```\n\nSince spaCy does not provide an official constituency parsing API, all methods are accessible through the extension namespaces `Span._` and `Token._`.\n\nThe following extension properties are available:\n- `Span._.labels`: a tuple of labels for the given span. A span may have multiple labels when there are unary chains in the parse tree.\n- `Span._.parse_string`: a string representation of the parse tree for a given span.\n- `Span._.constituents`: an iterator over `Span` objects for sub-constituents in a pre-order traversal of the parse tree.\n- `Span._.parent`: the parent `Span` in the parse tree.\n- `Span._.children`: an iterator over child `Span`s in the parse tree.\n- `Token._.labels`, `Token._.parse_string`, `Token._.parent`: these behave the same as calling the corresponding method on the length-one Span containing the token.\n\nThese methods will raise an exception when called on a span that is not a constituent in the parse tree. Such errors can be avoided by traversing the parse tree starting at either sentence level (by iterating over `doc.sents`) or with an individual `Token` object.\n\n### Usage with NLTK\n\nThere is also an NLTK interface, which is designed for use with pre-tokenized datasets and treebanks, or when integrating the parser into an NLP pipeline that already performs (at minimum) tokenization and sentence splitting. For parsing starting with raw text, it is **strongly encouraged** that you use spaCy and `benepar.BeneparComponent` instead.\n\nSample usage with NLTK:\n```python\n\u003e\u003e\u003e import benepar\n\u003e\u003e\u003e parser = benepar.Parser(\"benepar_en3\")\n\u003e\u003e\u003e input_sentence = benepar.InputSentence(\n    words=['\"', 'Fly', 'safely', '.', '\"'],\n    space_after=[False, True, False, False, False],\n    tags=['``', 'VB', 'RB', '.', \"''\"],\n    escaped_words=['``', 'Fly', 'safely', '.', \"''\"],\n)\n\u003e\u003e\u003e tree = parser.parse(input_sentence)\n\u003e\u003e\u003e print(tree)\n(TOP (S (`` ``) (VP (VB Fly) (ADVP (RB safely))) (. .) ('' '')))\n```\n\nNot all fields of `benepar.InputSentence` are required, but at least one of `words` and `escaped_words` must be specified. The parser will attempt to guess the value for missing fields, for example:\n```python\n\u003e\u003e\u003e input_sentence = benepar.InputSentence(\n    words=['\"', 'Fly', 'safely', '.', '\"'],\n)\n\u003e\u003e\u003e parser.parse(input_sentence)\n```\n\nUse `parse_sents` to parse multiple sentences.\n```python\n\u003e\u003e\u003e input_sentence1 = benepar.InputSentence(\n    words=['The', 'time', 'for', 'action', 'is', 'now', '.'],\n)\n\u003e\u003e\u003e input_sentence2 = benepar.InputSentence(\n    words=['It', \"'s\", 'never', 'too', 'late', 'to', 'do', 'something', '.'],\n)\n\u003e\u003e\u003e parser.parse_sents([input_sentence1, input_sentence2])\n```\n\nSome parser models also allow Unicode text input for debugging/interactive use, but passing in raw text strings is *strongly discouraged* for any application where parsing accuracy matters.\n```python\n\u003e\u003e\u003e parser.parse('\"Fly safely.\"')  # For debugging/interactive use only.\n```\nWhen parsing from raw text, we recommend using spaCy and `benepar.BeneparComponent` instead. The reason is that parser models do not ship with a tokenizer or sentence splitter, and some models may not include a part-of-speech tagger either. A toolkit must be used to fill in these pipeline components, and spaCy outperforms NLTK in all of these areas (sometimes by a large margin). \n\n\n\n## Available Models\n\nThe following trained parser models are available. To use spaCy integration, you will also need to install a [spaCy model for the appropriate language](https://spacy.io/models).\n\nModel       | Language | Info\n----------- | -------- | ----\n`benepar_en3` | English | 95.40 F1 on [revised](https://catalog.ldc.upenn.edu/LDC2015T13) WSJ test set. The training data uses revised tokenization and syntactic annotation based on the same guidelines as the English Web Treebank and OntoNotes, which better matches modern tokenization practices in libraries like spaCy. Based on T5-small.\n`benepar_en3_large` | English | 96.29 F1 on [revised](https://catalog.ldc.upenn.edu/LDC2015T13) WSJ test set. The training data uses revised tokenization and syntactic annotation based on the same guidelines as the English Web Treebank and OntoNotes, which better matches modern tokenization practices in libraries like spaCy. Based on T5-large.\n`benepar_zh2` | Chinese | 92.56 F1 on CTB 5.1 test set. Usage with spaCy allows supports parsing from raw text, but the NLTK API only supports parsing previously tokenized sentences. Based on Chinese ELECTRA-180G-large.\n`benepar_ar2` | Arabic | 90.52 F1 on SPMRL2013/2014 test set. Only supports using the NLTK API for parsing previously tokenized sentences. Parsing from raw text and spaCy integration are not supported. Based on XLM-R.\n`benepar_de2` | German | 92.10 F1 on SPMRL2013/2014 test set. Based on XLM-R.\n`benepar_eu2` | Basque | 93.36 F1 on SPMRL2013/2014 test set. Usage with spaCy first requires implementing Basque support in spaCy. Based on XLM-R.\n`benepar_fr2` | French | 88.43 F1 on SPMRL2013/2014 test set. Based on XLM-R.\n`benepar_he2` | Hebrew | 93.98 F1 on SPMRL2013/2014 test set. Only supports using the NLTK API for parsing previously tokenized sentences. Parsing from raw text and spaCy integration are not supported. Based on XLM-R.\n`benepar_hu2` | Hungarian | 96.19 F1 on SPMRL2013/2014 test set. Usage with spaCy requires a [Hungarian model for spaCy](https://github.com/oroszgy/spacy-hungarian-models). The NLTK API only supports parsing previously tokenized sentences. Based on XLM-R.\n`benepar_ko2` | Korean | 91.72 F1 on SPMRL2013/2014 test set. Can be used with spaCy's [multi-language sentence segmentation model](https://spacy.io/models/xx#xx_sent_ud_sm) (requires spaCy v3.0). The NLTK API only supports parsing previously tokenized sentences. Based on XLM-R.\n`benepar_pl2` | Polish | 97.15 F1 on SPMRL2013/2014 test set. Based on XLM-R.\n`benepar_sv2` | Swedish | 92.21 F1 on SPMRL2013/2014 test set. Can be used with spaCy's [multi-language sentence segmentation model](https://spacy.io/models/xx#xx_sent_ud_sm) (requires spaCy v3.0). Based on XLM-R.\n`benepar_en3_wsj` | English | **Consider using `benepar_en3` or `benepar_en3_large` instead**. 95.55 F1 on [canonical](https://catalog.ldc.upenn.edu/LDC99T42) WSJ test set used for decades of English constituency parsing publications. Based on BERT-large-uncased. We believe that the revised annotation guidelines used for training `benepar_en3`/`benepar_en3_large` are more suitable for downstream use because they better handle language usage in web text, and are more consistent with modern practices in dependency parsing and libraries like spaCy. Nevertheless, we provide the `benepar_en3_wsj` model for cases where using the revised treebanking conventions are not appropriate, such as benchmarking different models on the same dataset.\n\n## Training\n\nTraining requires cloning this repository from GitHub. While the model code in `src/benepar` is distributed in the `benepar` package on PyPI, the training and evaluation scripts directly under `src/` are not.\n\n#### Software Requirements for Training\n* Python 3.7 or higher.\n* [PyTorch](http://pytorch.org/) 1.6.0, or any compatible version.\n* All dependencies required by the `benepar` package, including: [NLTK](https://www.nltk.org/) 3.2, [torch-struct](https://github.com/harvardnlp/pytorch-struct) 0.4, [transformers](https://github.com/huggingface/transformers) 4.3.0, or compatible.\n* [pytokenizations](https://github.com/tamuhey/tokenizations/) 0.7.2 or compatible.\n* [EVALB](http://nlp.cs.nyu.edu/evalb/). Before starting, run `make` inside the `EVALB/` directory to compile an `evalb` executable. This will be called from Python for evaluation. If training on the SPMRL datasets, you will need to run `make` inside the `EVALB_SPMRL/` directory instead.\n\n### Training Instructions\n\nA new model can be trained using the command `python src/main.py train ...`. Some of the available arguments are:\n\nArgument | Description | Default\n--- | --- | ---\n`--model-path-base` | Path base to use for saving models | N/A\n`--evalb-dir` |  Path to EVALB directory | `EVALB/`\n`--train-path` | Path to training trees | `data/wsj/train_02-21.LDC99T42`\n`--train-path-text` | Optional non-destructive tokenization of the training data | Guess raw text; see `--text-processing`\n`--dev-path` | Path to development trees | `data/wsj/dev_22.LDC99T42`\n`--dev-path-text` | Optional non-destructive tokenization of the development data | Guess raw text; see `--text-processing`\n`--text-processing` | Heuristics for guessing raw text from descructively tokenized tree files. See `load_trees()` in `src/treebanks.py` | Default rules for languages other than Arabic, Chinese, and Hebrew\n`--subbatch-max-tokens` | Maximum number of tokens to process in parallel while training (a full batch may not fit in GPU memory) | 2000\n`--parallelize` | Distribute pre-trained model (e.g. T5) layers across multiple GPUs. | Use at most one GPU\n`--batch-size` | Number of examples per training update | 32\n`--checks-per-epoch` | Number of development evaluations per epoch | 4\n`--numpy-seed` | NumPy random seed | Random\n`--use-pretrained` | Use pre-trained encoder | Do not use pre-trained encoder\n`--pretrained-model` | Model to use if `--use-pretrained` is passed. May be a path or a model id from the [HuggingFace Model Hub](https://huggingface.co/models)| `bert-base-uncased`\n`--predict-tags` | Adds a part-of-speech tagging component and auxiliary loss to the parser | Do not predict tags\n`--use-chars-lstm` | Use learned CharLSTM word representations | Do not use CharLSTM\n`--use-encoder` | Use learned transformer layers on top of pre-trained model or CharLSTM | Do not use extra transformer layers\n`--num-layers` | Number of transformer layers to use if `--use-encoder` is passed | 8\n`--encoder-max-len` | Maximum sentence length (in words) allowed for extra transformer layers | 512\n\nAdditional arguments are available for other hyperparameters; see `make_hparams()` in `src/main.py`. These can be specified on the command line, such as `--num-layers 2` (for numerical parameters), `--predict-tags` (for boolean parameters that default to False), or `--no-XXX` (for boolean parameters that default to True).\n\nFor each development evaluation, the F-score on the development set is computed and compared to the previous best. If the current model is better, the previous model will be deleted and the current model will be saved. The new filename will be derived from the provided model path base and the development F-score.\n\nPrior to training the parser, you will first need to obtain appropriate training data. We provide [instructions on how to process standard datasets like PTB, CTB, and the SMPRL 2013/2014 Shared Task data](data/README.md). After following the instructions for the English WSJ data, you can use the following command to train an English parser using the default hyperparameters:\n\n```\npython src/main.py train --use-pretrained --model-path-base models/en_bert_base\n```\n\nSee [`EXPERIMENTS.md`](EXPERIMENTS.md) for more examples of good hyperparameter choices.\n\n### Evaluation Instructions\n\nA saved model can be evaluated on a test corpus using the command `python src/main.py test ...` with the following arguments:\n\nArgument | Description | Default\n--- | --- | ---\n`--model-path` | Path of saved model | N/A\n`--evalb-dir` |  Path to EVALB directory | `EVALB/`\n`--test-path` | Path to test trees | `data/23.auto.clean`\n`--test-path-text` | Optional non-destructive tokenization of the test data | Guess raw text; see `--text-processing`\n`--text-processing` | Heuristics for guessing raw text from descructively tokenized tree files. See `load_trees()` in `src/treebanks.py` | Default rules for languages other than Arabic, Chinese, and Hebrew\n`--test-path-raw` | Alternative path to test trees that is used for evalb only (used to double-check that evaluation against pre-processed trees does not contain any bugs) | Compare to trees from `--test-path`\n`--subbatch-max-tokens` | Maximum number of tokens to process in parallel (a GPU does not have enough memory to process the full dataset in one batch) | 500\n`--parallelize` | Distribute pre-trained model (e.g. T5) layers across multiple GPUs. | Use at most one GPU\n`--output-path` | Path to write predicted trees to (use `\"-\"` for stdout). | Do not save predicted trees\n`--no-predict-tags` | Use gold part-of-speech tags when running EVALB. This is the standard for publications, and omitting this flag may give erroneously high F1 scores. | Use predicted part-of-speech tags for EVALB, if available\n\nAs an example, you can evaluate a trained model using the following command:\n```\npython src/main.py test --model-path models/en_bert_base_dev=*.pt\n```\n\n### Exporting Models for Inference\n\nThe `benepar` package can directly use saved checkpoints by replacing a model name like `benepar_en3` with a path such as `models/en_bert_base_dev_dev=95.67.pt`. However, releasing the single-file checkpoints has a few shortcomings:\n* Single-file checkpoints do not include the tokenizer or pre-trained model config. These can generally be downloaded automatically from the HuggingFace model hub, but this requires an Internet connection and may also (incidentally and unnecessarily) download pre-trained weights from the HuggingFace Model Hub\n* Single-file checkpoints are 3x larger than necessary, because they save optimizer state\n\nUse `src/export.py` to convert a checkpoint file into a directory that encapsulates everything about a trained model. For example,\n```\npython src/export.py export \\\n  --model-path models/en_bert_base_dev=*.pt \\\n  --output-dir=models/en_bert_base\n```\n\nWhen exporting, there is also a `--compress` option that slightly adjusts model weights, so that the output directory can be compressed into a ZIP archive of much smaller size. We use this for our official model releases, because it's a hassle to distribute model weights that are 2GB+ in size. When using the `--compress` option, it is recommended to specify a test set in order to verify that compression indeed has minimal impact on parsing accuracy. Using the development data for verification is not recommended, since the development data was already used for the model selection criterion during training.\n```\npython src/export.py export \\\n  --model-path models/en_bert_base_dev=*.pt \\\n  --output-dir=models/en_bert_base \\\n  --test-path=data/wsj/test_23.LDC99T42\n```\n\nThe `src/export.py` script also has a `test` subcommand that's roughly similar to `python src/main.py test`, except that it supports exported models and has slightly different flags. We can run the following command to verify that our English parser using BERT-large-uncased indeed achieves 95.55 F1 on the canonical WSJ test set:\n```\npython src/export.py test --model-path benepar_en3_wsj --test-path data/wsj/test_23.LDC99T42\n```\n\n## Reproducing Experiments\n\nSee [`EXPERIMENTS.md`](EXPERIMENTS.md) for instructions on how to reproduce experiments reported in our ACL 2018 and 2019 papers.\n\n## Citation\n\nIf you use this software for research, please cite our papers as follows:\n\n```\n@inproceedings{kitaev-etal-2019-multilingual,\n    title = \"Multilingual Constituency Parsing with Self-Attention and Pre-Training\",\n    author = \"Kitaev, Nikita  and\n      Cao, Steven  and\n      Klein, Dan\",\n    booktitle = \"Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics\",\n    month = jul,\n    year = \"2019\",\n    address = \"Florence, Italy\",\n    publisher = \"Association for Computational Linguistics\",\n    url = \"https://www.aclweb.org/anthology/P19-1340\",\n    doi = \"10.18653/v1/P19-1340\",\n    pages = \"3499--3505\",\n}\n\n@inproceedings{kitaev-klein-2018-constituency,\n    title = \"Constituency Parsing with a Self-Attentive Encoder\",\n    author = \"Kitaev, Nikita  and\n      Klein, Dan\",\n    booktitle = \"Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)\",\n    month = jul,\n    year = \"2018\",\n    address = \"Melbourne, Australia\",\n    publisher = \"Association for Computational Linguistics\",\n    url = \"https://www.aclweb.org/anthology/P18-1249\",\n    doi = \"10.18653/v1/P18-1249\",\n    pages = \"2676--2686\",\n}\n```\n\n## Credits\n\nThe code in this repository and portions of this README are based on https://github.com/mitchellstern/minimal-span-parser\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fnikitakit%2Fself-attentive-parser","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Fnikitakit%2Fself-attentive-parser","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fnikitakit%2Fself-attentive-parser/lists"}