Ecosyste.ms: Awesome

An open API service indexing awesome lists of open source software.

Awesome Lists | Featured Topics | Projects

https://github.com/howardyclo/grammar-pattern

Extract and align grammar patterns from English sentences.
https://github.com/howardyclo/grammar-pattern

chunking grammar grammar-parser grammar-pattern grammar-rules shallow-parser

Last synced: 11 days ago
JSON representation

Extract and align grammar patterns from English sentences.

Host: GitHub
URL: https://github.com/howardyclo/grammar-pattern
Owner: howardyclo
Created: 2018-06-07T06:13:16.000Z (over 6 years ago)
Default Branch: master
Last Pushed: 2022-12-08T02:14:41.000Z (almost 2 years ago)
Last Synced: 2023-03-04T03:54:29.181Z (over 1 year ago)
Topics: chunking, grammar, grammar-parser, grammar-pattern, grammar-rules, shallow-parser
Language: Python
Homepage:
Size: 128 KB
Stars: 45
Watchers: 6
Forks: 10
Open Issues: 7
Metadata Files:
- Readme: README.md

Awesome Lists containing this project

README

        # grammar-pattern

This repo offers several python (3.x) modules for grammatical analysis:

1. Extracting grammar patterns from sentences. For example, the grammar pattern for **"discuss"** in the sentence **"He likes to discuss the issues ."** would be **"V n"**.

2. Aligning grammar patterns from parallel sentences. For example, grammatically erroneous source sentence **"He likes to discuss about the issues ."** → grammatically correct target sentence **"He likes to discuss the issues"**, the aligned grammar pattern for **"discuss"** would be **"V about n" → "V n"**.

We currently support grammar patterns for verb, noun and adjective headwords. See what grammar pattern is [in Wikipedia](https://en.wikipedia.org/wiki/Pattern_grammar).

## Setup

Before starting to use modules, please install the python dependencies (mainly [spaCy](https://spacy.io/) and [NLTK](https://www.nltk.org/)):

```sh

$ pip install -r requirements.txt

$ python -m spacy download en_core_web_lg 

```

You can simply run `test.py` to check if we miss some required modules or data.

```sh

$ python test.py

```

## Example Usages

Here we demonstrate how to test our shallow parser, extract grammar patterns for a sentence or align grammar patterns for parallel sentences.

### 0. Preprocess the sentences (See [How to use AllenNLP Constituency Tree Parser](how-to-use-allennlp-constituency-tree-parser/README.md))

Run an existing constituency tree parser to get linearized constituency tree string for every sentence as a pre-processing step. The constituency tree parser we use is [AllenNLP](https://github.com/allenai/allennlp). They have also an [online demo](http://demo.allennlp.org/constituency-parsing).





![Alt text](imgs/1.png)

### 1. Import modules

```python

from modules.shallow_parser import shallow_parse

from modules.grampat import sent_to_pats, align_parallel_pats

```

### 2. Get shallow parsed results from sentences

```python

# source sentence: "He liked to discuss about the issues ."

# target sentence: "He likes to discuss the issues ."

# Note that we parse sentences in advance using AllenNLP's constituency tree parser.

src_parsed = shallow_parse("(S (NP (PRP He)) (VP (VBD liked) (S (VP (TO to) (VP (VB discuss) (PP (IN about) (NP (DT the) (NNS issues))))))) (. .))")

tgt_parsed = shallow_parse("(S (NP (PRP He)) (VP (VBZ likes) (S (VP (TO to) (VP (VB discuss) (NP (DT the) (NNS issues)))))) (. .))")

```

```python 

print(src_parsed)

[[['He'], ['liked'], ['to'], ['discuss'], ['about'], ['the', 'issues'], ['.']],

 [['he'], ['like'], ['to'], ['discuss'], ['about'], ['the', 'issue'], ['.']],

 [['PRP'], ['VBD'], ['TO'], ['VB'], ['IN'], ['DT', 'NNS'], ['.']],

 [['H-NP'], ['H-VP'], ['H-VP'], ['H-VP'], ['H-PP'], ['I-NP', 'H-NP'], ['O']]]

```

```python

print(tgt_parsed)

[[['He'], ['likes'], ['to'], ['discuss'], ['the', 'issues'], ['.']],

 [['he'], ['like'], ['to'], ['discuss'], ['the', 'issue'], ['.']],

 [['PRP'], ['VBZ'], ['TO'], ['VB'], ['DT', 'NNS'], ['.']],

 [['H-NP'], ['H-VP'], ['H-VP'], ['H-VP'], ['I-NP', 'H-NP'], ['O']]]

```

`shallow_parse()` returns a list of chunked elements:

- Original words

- Base form of original words (lemmas)

- POS tag from constituency tree string

- Chunk tags

Note that the prefix `HIO` of chunk tags represents:

- `H`: Headword of a chunk. This is the headword of a grammar pattern we're interested in. We simply **select the last word of a chunk as our headword**.

- `I`: Non-headword of a chunk.

- `O`: Outside of a chunk. This is often a punctuation word and not important in our case.

### 3. Extract grammar patterns from sentences

```python

src_pats = sent_to_pats(src_parsed)

tgt_pats = sent_to_pats(tgt_parsed)

```

```python

print(src_pats)

[('LIKE', 'V to v', 'liked to discuss', (1, 3)),

 ('DISCUSS', 'V about n', 'discuss about the issues', (3, 5))]

```

```python

print(tgt_pats)

[('LIKE', 'V to v', 'likes to discuss', (1, 3)),

 ('DISCUSS', 'V n', 'discuss the issues', (3, 4))]

```

`sent_to_pats()` returns a list of tuples, each tuple contains:

- Headword

- Grammar pattern (POS tag in uppercase corresponds to the headword).

- N-gram that matches grammar pattern

- Start and end positions of n-gram in chunked sentence.

How does `sent_to_pats()` works:

- Generate a list of n-grams of parsed results.

- For every n-gram, identify if **hand-selected** grammar patterns (listed in `grampat.py`) exist in an n-gram.

- The grammar patterns are selected from [*Collins COBUILD Grammar Patterns I: Verb*](http://arts-ccr-002.bham.ac.uk/ccr/patgram/) and [*Grammar Patterns II: Nouns and Adjectives*](https://www.amazon.com/Grammar-Patterns-II-Adjectives-COBUILD/dp/0003750671) in advance, which are annotated from experts. We believe those grammar patterns are generally good and able to cover most grammar patterns we used in English.

- Note that it is possible to automatically find good grammar patterns from large monolingual corpora by counting frequencies of various n-grams of POS tag, and select good n-grams of POS tag by frequency. We can roughly interpret grammar pattern as simplied n-gram of POS tag.

### 4. Align grammar patterns for parallel sentences

```python

parallel_pats = align_parallel_pats(src_pats, tgt_pats)

```

```python

print(parallel_pats)

[[('LIKE', 'V to v', 'liked to discuss', (1, 3)),

  ('LIKE', 'V to v', 'likes to discuss', (1, 3))],

 [('DISCUSS', 'V about n', 'discuss about the issues', (3, 5)),

  ('DISCUSS', 'V n', 'discuss the issues', (3, 4))]]

```

`align_parallel_pats()` returns a list of aligned grammar patterns.

## What's Next?

Now that you've completed the *Example Usages* guide, we can use these modules to count grammar patterns for large English monolingual corpora (BNC) and parallel grammatical error correction corpora (EFCAMDAT, LANG-8, CLC-FCE). We released a python script for doing this (support multi-processing):





```sh

$ python compute_grampat.py \

-in_src_path data/src.tree.txt \

-in_tgt_path data/tgt.tree.txt \

-out_path data \

-out_prefix dataset_name \

-n_jobs 4 \

-batch_size 1024

```

The data structure of the output file `data/dataset_name.grampat.dill` is a Python Dictionary containing two keys:

- `"count_dict"` (3-nested dict):

    - key1: source grammar pattern (str)

    - key2: target grammar pattern (str)

    - key3: headword in uppercase (str)

    - value: count

    - Note: We also save the instances that source grammar pattern is same as target grammar pattern.

- `"ngram_dict"` (4-nested dict):

    - key1: source grammar pattern (str)

    - key2: target grammar pattern (str)

    - key3: headword in uppercase (str)

    - key4: (source ngram, target ngram) (tuple)

    - value: count 

We released grammar pattern results for [BNC, EFCAMDAT, LANG-8 and CLC-FCE](https://goo.gl/aKR7Hr). It can be used for grammatical analysis (See `query_grampat.py` for example usage).

## Citation

If you find the repo helpful for your research, you can cite it with the following BibTeX:

```

@software{yi_chen_howard_lo_2020_3611412,

  author       = {Yi-Chen Howard Lo},

  title        = {howardyclo/grammar-pattern},

  month        = jan,

  year         = 2020,

  publisher    = {Zenodo},

  version      = {v1.0.0},

  doi          = {10.5281/zenodo.3611412},

  url          = {https://doi.org/10.5281/zenodo.3611412}

}

```

or clicking this badge [![DOI](https://zenodo.org/badge/136430580.svg)](https://zenodo.org/badge/latestdoi/136430580)

to export any format you like (on the right hand side of the website).