{"id":19074356,"url":"https://github.com/howardyclo/grammar-pattern","last_synced_at":"2025-10-24T06:09:19.197Z","repository":{"id":57731886,"uuid":"136430580","full_name":"howardyclo/grammar-pattern","owner":"howardyclo","description":"Extract and align grammar patterns from English sentences.","archived":false,"fork":false,"pushed_at":"2022-12-08T02:14:41.000Z","size":131,"stargazers_count":54,"open_issues_count":7,"forks_count":10,"subscribers_count":5,"default_branch":"master","last_synced_at":"2025-04-18T19:41:08.636Z","etag":null,"topics":["chunking","grammar","grammar-parser","grammar-pattern","grammar-rules","shallow-parser"],"latest_commit_sha":null,"homepage":"","language":"Python","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":null,"status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/howardyclo.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":null,"license":null,"code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null}},"created_at":"2018-06-07T06:13:16.000Z","updated_at":"2024-10-25T05:15:27.000Z","dependencies_parsed_at":"2023-01-25T03:15:35.680Z","dependency_job_id":null,"html_url":"https://github.com/howardyclo/grammar-pattern","commit_stats":null,"previous_names":[],"tags_count":1,"template":false,"template_full_name":null,"repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/howardyclo%2Fgrammar-pattern","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/howardyclo%2Fgrammar-pattern/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/howardyclo%2Fgrammar-pattern/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/howardyclo%2Fgrammar-pattern/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/howardyclo","download_url":"https://codeload.github.com/howardyclo/grammar-pattern/tar.gz/refs/heads/master","host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":251389338,"owners_count":21581779,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["chunking","grammar","grammar-parser","grammar-pattern","grammar-rules","shallow-parser"],"created_at":"2024-11-09T01:50:39.703Z","updated_at":"2025-10-24T06:09:19.114Z","avatar_url":"https://github.com/howardyclo.png","language":"Python","funding_links":[],"categories":[],"sub_categories":[],"readme":"# grammar-pattern\n\nThis repo offers several python (3.x) modules for grammatical analysis:\n1. Extracting grammar patterns from sentences. For example, the grammar pattern for **\"discuss\"** in the sentence **\"He likes to discuss the issues .\"** would be **\"V n\"**.\n2. Aligning grammar patterns from parallel sentences. For example, grammatically erroneous source sentence **\"He likes to discuss about the issues .\"** → grammatically correct target sentence **\"He likes to discuss the issues\"**, the aligned grammar pattern for **\"discuss\"** would be **\"V about n\" → \"V n\"**.\n\nWe currently support grammar patterns for verb, noun and adjective headwords. See what grammar pattern is [in Wikipedia](https://en.wikipedia.org/wiki/Pattern_grammar).\n\n## Setup\nBefore starting to use modules, please install the python dependencies (mainly [spaCy](https://spacy.io/) and [NLTK](https://www.nltk.org/)):\n```sh\n$ pip install -r requirements.txt\n\n$ python -m spacy download en_core_web_lg \n```\n\nYou can simply run `test.py` to check if we miss some required modules or data.\n```sh\n$ python test.py\n```\n\n## Example Usages\nHere we demonstrate how to test our shallow parser, extract grammar patterns for a sentence or align grammar patterns for parallel sentences.\n\n### 0. Preprocess the sentences (See [How to use AllenNLP Constituency Tree Parser](how-to-use-allennlp-constituency-tree-parser/README.md))\nRun an existing constituency tree parser to get linearized constituency tree string for every sentence as a pre-processing step. The constituency tree parser we use is [AllenNLP](https://github.com/allenai/allennlp). They have also an [online demo](http://demo.allennlp.org/constituency-parsing).\n\u003cbr\u003e\u003cbr\u003e\n![Alt text](imgs/1.png)\n\n### 1. Import modules\n```python\nfrom modules.shallow_parser import shallow_parse\nfrom modules.grampat import sent_to_pats, align_parallel_pats\n```\n\n### 2. Get shallow parsed results from sentences\n```python\n# source sentence: \"He liked to discuss about the issues .\"\n# target sentence: \"He likes to discuss the issues .\"\n# Note that we parse sentences in advance using AllenNLP's constituency tree parser.\n\nsrc_parsed = shallow_parse(\"(S (NP (PRP He)) (VP (VBD liked) (S (VP (TO to) (VP (VB discuss) (PP (IN about) (NP (DT the) (NNS issues))))))) (. .))\")\ntgt_parsed = shallow_parse(\"(S (NP (PRP He)) (VP (VBZ likes) (S (VP (TO to) (VP (VB discuss) (NP (DT the) (NNS issues)))))) (. .))\")\n```\n```python \nprint(src_parsed)\n\n[[['He'], ['liked'], ['to'], ['discuss'], ['about'], ['the', 'issues'], ['.']],\n [['he'], ['like'], ['to'], ['discuss'], ['about'], ['the', 'issue'], ['.']],\n [['PRP'], ['VBD'], ['TO'], ['VB'], ['IN'], ['DT', 'NNS'], ['.']],\n [['H-NP'], ['H-VP'], ['H-VP'], ['H-VP'], ['H-PP'], ['I-NP', 'H-NP'], ['O']]]\n```\n```python\nprint(tgt_parsed)\n\n[[['He'], ['likes'], ['to'], ['discuss'], ['the', 'issues'], ['.']],\n [['he'], ['like'], ['to'], ['discuss'], ['the', 'issue'], ['.']],\n [['PRP'], ['VBZ'], ['TO'], ['VB'], ['DT', 'NNS'], ['.']],\n [['H-NP'], ['H-VP'], ['H-VP'], ['H-VP'], ['I-NP', 'H-NP'], ['O']]]\n```\n`shallow_parse()` returns a list of chunked elements:\n- Original words\n- Base form of original words (lemmas)\n- POS tag from constituency tree string\n- Chunk tags\n\nNote that the prefix `HIO` of chunk tags represents:\n- `H`: Headword of a chunk. This is the headword of a grammar pattern we're interested in. We simply **select the last word of a chunk as our headword**.\n- `I`: Non-headword of a chunk.\n- `O`: Outside of a chunk. This is often a punctuation word and not important in our case.\n\n### 3. Extract grammar patterns from sentences\n```python\nsrc_pats = sent_to_pats(src_parsed)\ntgt_pats = sent_to_pats(tgt_parsed)\n```\n```python\nprint(src_pats)\n\n[('LIKE', 'V to v', 'liked to discuss', (1, 3)),\n ('DISCUSS', 'V about n', 'discuss about the issues', (3, 5))]\n```\n```python\nprint(tgt_pats)\n\n[('LIKE', 'V to v', 'likes to discuss', (1, 3)),\n ('DISCUSS', 'V n', 'discuss the issues', (3, 4))]\n```\n`sent_to_pats()` returns a list of tuples, each tuple contains:\n- Headword\n- Grammar pattern (POS tag in uppercase corresponds to the headword).\n- N-gram that matches grammar pattern\n- Start and end positions of n-gram in chunked sentence.\n\nHow does `sent_to_pats()` works:\n- Generate a list of n-grams of parsed results.\n- For every n-gram, identify if **hand-selected** grammar patterns (listed in `grampat.py`) exist in an n-gram.\n- The grammar patterns are selected from [*Collins COBUILD Grammar Patterns I: Verb*](http://arts-ccr-002.bham.ac.uk/ccr/patgram/) and [*Grammar Patterns II: Nouns and Adjectives*](https://www.amazon.com/Grammar-Patterns-II-Adjectives-COBUILD/dp/0003750671) in advance, which are annotated from experts. We believe those grammar patterns are generally good and able to cover most grammar patterns we used in English.\n- Note that it is possible to automatically find good grammar patterns from large monolingual corpora by counting frequencies of various n-grams of POS tag, and select good n-grams of POS tag by frequency. We can roughly interpret grammar pattern as simplied n-gram of POS tag.\n\n### 4. Align grammar patterns for parallel sentences\n```python\nparallel_pats = align_parallel_pats(src_pats, tgt_pats)\n```\n```python\nprint(parallel_pats)\n\n[[('LIKE', 'V to v', 'liked to discuss', (1, 3)),\n  ('LIKE', 'V to v', 'likes to discuss', (1, 3))],\n [('DISCUSS', 'V about n', 'discuss about the issues', (3, 5)),\n  ('DISCUSS', 'V n', 'discuss the issues', (3, 4))]]\n```\n`align_parallel_pats()` returns a list of aligned grammar patterns.\n\n## What's Next?\nNow that you've completed the *Example Usages* guide, we can use these modules to count grammar patterns for large English monolingual corpora (BNC) and parallel grammatical error correction corpora (EFCAMDAT, LANG-8, CLC-FCE). We released a python script for doing this (support multi-processing):\n\u003cbr\u003e\u003cbr\u003e\n```sh\n$ python compute_grampat.py \\\n-in_src_path data/src.tree.txt \\\n-in_tgt_path data/tgt.tree.txt \\\n-out_path data \\\n-out_prefix dataset_name \\\n-n_jobs 4 \\\n-batch_size 1024\n```\n\nThe data structure of the output file `data/dataset_name.grampat.dill` is a Python Dictionary containing two keys:\n\n- `\"count_dict\"` (3-nested dict):\n    - key1: source grammar pattern (str)\n    - key2: target grammar pattern (str)\n    - key3: headword in uppercase (str)\n    - value: count\n    - Note: We also save the instances that source grammar pattern is same as target grammar pattern.\n- `\"ngram_dict\"` (4-nested dict):\n    - key1: source grammar pattern (str)\n    - key2: target grammar pattern (str)\n    - key3: headword in uppercase (str)\n    - key4: (source ngram, target ngram) (tuple)\n    - value: count \n\nWe released grammar pattern results for [BNC, EFCAMDAT, LANG-8 and CLC-FCE](https://goo.gl/aKR7Hr). It can be used for grammatical analysis (See `query_grampat.py` for example usage).\n\n## Citation\nIf you find the repo helpful for your research, you can cite it with the following BibTeX:\n```\n@software{yi_chen_howard_lo_2020_3611412,\n  author       = {Yi-Chen Howard Lo},\n  title        = {howardyclo/grammar-pattern},\n  month        = jan,\n  year         = 2020,\n  publisher    = {Zenodo},\n  version      = {v1.0.0},\n  doi          = {10.5281/zenodo.3611412},\n  url          = {https://doi.org/10.5281/zenodo.3611412}\n}\n```\nor clicking this badge [![DOI](https://zenodo.org/badge/136430580.svg)](https://zenodo.org/badge/latestdoi/136430580)\nto export any format you like (on the right hand side of the website).\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fhowardyclo%2Fgrammar-pattern","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Fhowardyclo%2Fgrammar-pattern","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fhowardyclo%2Fgrammar-pattern/lists"}