Ecosyste.ms: Awesome
An open API service indexing awesome lists of open source software.
https://github.com/howardyclo/grammar-pattern
Extract and align grammar patterns from English sentences.
https://github.com/howardyclo/grammar-pattern
chunking grammar grammar-parser grammar-pattern grammar-rules shallow-parser
Last synced: 11 days ago
JSON representation
Extract and align grammar patterns from English sentences.
- Host: GitHub
- URL: https://github.com/howardyclo/grammar-pattern
- Owner: howardyclo
- Created: 2018-06-07T06:13:16.000Z (over 6 years ago)
- Default Branch: master
- Last Pushed: 2022-12-08T02:14:41.000Z (almost 2 years ago)
- Last Synced: 2023-03-04T03:54:29.181Z (over 1 year ago)
- Topics: chunking, grammar, grammar-parser, grammar-pattern, grammar-rules, shallow-parser
- Language: Python
- Homepage:
- Size: 128 KB
- Stars: 45
- Watchers: 6
- Forks: 10
- Open Issues: 7
-
Metadata Files:
- Readme: README.md
Awesome Lists containing this project
README
# grammar-pattern
This repo offers several python (3.x) modules for grammatical analysis:
1. Extracting grammar patterns from sentences. For example, the grammar pattern for **"discuss"** in the sentence **"He likes to discuss the issues ."** would be **"V n"**.
2. Aligning grammar patterns from parallel sentences. For example, grammatically erroneous source sentence **"He likes to discuss about the issues ."** → grammatically correct target sentence **"He likes to discuss the issues"**, the aligned grammar pattern for **"discuss"** would be **"V about n" → "V n"**.We currently support grammar patterns for verb, noun and adjective headwords. See what grammar pattern is [in Wikipedia](https://en.wikipedia.org/wiki/Pattern_grammar).
## Setup
Before starting to use modules, please install the python dependencies (mainly [spaCy](https://spacy.io/) and [NLTK](https://www.nltk.org/)):
```sh
$ pip install -r requirements.txt$ python -m spacy download en_core_web_lg
```You can simply run `test.py` to check if we miss some required modules or data.
```sh
$ python test.py
```## Example Usages
Here we demonstrate how to test our shallow parser, extract grammar patterns for a sentence or align grammar patterns for parallel sentences.### 0. Preprocess the sentences (See [How to use AllenNLP Constituency Tree Parser](how-to-use-allennlp-constituency-tree-parser/README.md))
Run an existing constituency tree parser to get linearized constituency tree string for every sentence as a pre-processing step. The constituency tree parser we use is [AllenNLP](https://github.com/allenai/allennlp). They have also an [online demo](http://demo.allennlp.org/constituency-parsing).
![Alt text](imgs/1.png)### 1. Import modules
```python
from modules.shallow_parser import shallow_parse
from modules.grampat import sent_to_pats, align_parallel_pats
```### 2. Get shallow parsed results from sentences
```python
# source sentence: "He liked to discuss about the issues ."
# target sentence: "He likes to discuss the issues ."
# Note that we parse sentences in advance using AllenNLP's constituency tree parser.src_parsed = shallow_parse("(S (NP (PRP He)) (VP (VBD liked) (S (VP (TO to) (VP (VB discuss) (PP (IN about) (NP (DT the) (NNS issues))))))) (. .))")
tgt_parsed = shallow_parse("(S (NP (PRP He)) (VP (VBZ likes) (S (VP (TO to) (VP (VB discuss) (NP (DT the) (NNS issues)))))) (. .))")
```
```python
print(src_parsed)[[['He'], ['liked'], ['to'], ['discuss'], ['about'], ['the', 'issues'], ['.']],
[['he'], ['like'], ['to'], ['discuss'], ['about'], ['the', 'issue'], ['.']],
[['PRP'], ['VBD'], ['TO'], ['VB'], ['IN'], ['DT', 'NNS'], ['.']],
[['H-NP'], ['H-VP'], ['H-VP'], ['H-VP'], ['H-PP'], ['I-NP', 'H-NP'], ['O']]]
```
```python
print(tgt_parsed)[[['He'], ['likes'], ['to'], ['discuss'], ['the', 'issues'], ['.']],
[['he'], ['like'], ['to'], ['discuss'], ['the', 'issue'], ['.']],
[['PRP'], ['VBZ'], ['TO'], ['VB'], ['DT', 'NNS'], ['.']],
[['H-NP'], ['H-VP'], ['H-VP'], ['H-VP'], ['I-NP', 'H-NP'], ['O']]]
```
`shallow_parse()` returns a list of chunked elements:
- Original words
- Base form of original words (lemmas)
- POS tag from constituency tree string
- Chunk tagsNote that the prefix `HIO` of chunk tags represents:
- `H`: Headword of a chunk. This is the headword of a grammar pattern we're interested in. We simply **select the last word of a chunk as our headword**.
- `I`: Non-headword of a chunk.
- `O`: Outside of a chunk. This is often a punctuation word and not important in our case.### 3. Extract grammar patterns from sentences
```python
src_pats = sent_to_pats(src_parsed)
tgt_pats = sent_to_pats(tgt_parsed)
```
```python
print(src_pats)[('LIKE', 'V to v', 'liked to discuss', (1, 3)),
('DISCUSS', 'V about n', 'discuss about the issues', (3, 5))]
```
```python
print(tgt_pats)[('LIKE', 'V to v', 'likes to discuss', (1, 3)),
('DISCUSS', 'V n', 'discuss the issues', (3, 4))]
```
`sent_to_pats()` returns a list of tuples, each tuple contains:
- Headword
- Grammar pattern (POS tag in uppercase corresponds to the headword).
- N-gram that matches grammar pattern
- Start and end positions of n-gram in chunked sentence.How does `sent_to_pats()` works:
- Generate a list of n-grams of parsed results.
- For every n-gram, identify if **hand-selected** grammar patterns (listed in `grampat.py`) exist in an n-gram.
- The grammar patterns are selected from [*Collins COBUILD Grammar Patterns I: Verb*](http://arts-ccr-002.bham.ac.uk/ccr/patgram/) and [*Grammar Patterns II: Nouns and Adjectives*](https://www.amazon.com/Grammar-Patterns-II-Adjectives-COBUILD/dp/0003750671) in advance, which are annotated from experts. We believe those grammar patterns are generally good and able to cover most grammar patterns we used in English.
- Note that it is possible to automatically find good grammar patterns from large monolingual corpora by counting frequencies of various n-grams of POS tag, and select good n-grams of POS tag by frequency. We can roughly interpret grammar pattern as simplied n-gram of POS tag.### 4. Align grammar patterns for parallel sentences
```python
parallel_pats = align_parallel_pats(src_pats, tgt_pats)
```
```python
print(parallel_pats)[[('LIKE', 'V to v', 'liked to discuss', (1, 3)),
('LIKE', 'V to v', 'likes to discuss', (1, 3))],
[('DISCUSS', 'V about n', 'discuss about the issues', (3, 5)),
('DISCUSS', 'V n', 'discuss the issues', (3, 4))]]
```
`align_parallel_pats()` returns a list of aligned grammar patterns.## What's Next?
Now that you've completed the *Example Usages* guide, we can use these modules to count grammar patterns for large English monolingual corpora (BNC) and parallel grammatical error correction corpora (EFCAMDAT, LANG-8, CLC-FCE). We released a python script for doing this (support multi-processing):
```sh
$ python compute_grampat.py \
-in_src_path data/src.tree.txt \
-in_tgt_path data/tgt.tree.txt \
-out_path data \
-out_prefix dataset_name \
-n_jobs 4 \
-batch_size 1024
```The data structure of the output file `data/dataset_name.grampat.dill` is a Python Dictionary containing two keys:
- `"count_dict"` (3-nested dict):
- key1: source grammar pattern (str)
- key2: target grammar pattern (str)
- key3: headword in uppercase (str)
- value: count
- Note: We also save the instances that source grammar pattern is same as target grammar pattern.
- `"ngram_dict"` (4-nested dict):
- key1: source grammar pattern (str)
- key2: target grammar pattern (str)
- key3: headword in uppercase (str)
- key4: (source ngram, target ngram) (tuple)
- value: countWe released grammar pattern results for [BNC, EFCAMDAT, LANG-8 and CLC-FCE](https://goo.gl/aKR7Hr). It can be used for grammatical analysis (See `query_grampat.py` for example usage).
## Citation
If you find the repo helpful for your research, you can cite it with the following BibTeX:
```
@software{yi_chen_howard_lo_2020_3611412,
author = {Yi-Chen Howard Lo},
title = {howardyclo/grammar-pattern},
month = jan,
year = 2020,
publisher = {Zenodo},
version = {v1.0.0},
doi = {10.5281/zenodo.3611412},
url = {https://doi.org/10.5281/zenodo.3611412}
}
```
or clicking this badge [![DOI](https://zenodo.org/badge/136430580.svg)](https://zenodo.org/badge/latestdoi/136430580)
to export any format you like (on the right hand side of the website).