https://github.com/plandes/nlparse

Natural language processing parsing and tool library
https://github.com/plandes/nlparse

natural-language-processing nlp-machine-learning pypi-badge pypi-link spacy spacy-nlp

Last synced: 23 days ago
JSON representation

Natural language processing parsing and tool library

Host: GitHub
URL: https://github.com/plandes/nlparse
Owner: plandes
License: other
Created: 2019-07-07T23:06:57.000Z (over 6 years ago)
Default Branch: master
Last Pushed: 2024-04-09T19:20:27.000Z (over 1 year ago)
Last Synced: 2024-04-09T23:54:35.036Z (over 1 year ago)
Topics: natural-language-processing, nlp-machine-learning, pypi-badge, pypi-link, spacy, spacy-nlp
Language: Python
Homepage: https://plandes.github.io/nlparse/
Size: 1.09 MB
Stars: 5
Watchers: 4
Forks: 2
Open Issues: 0
Metadata Files:
- Readme: README.md
- Changelog: CHANGELOG.md
- Contributing: CONTRIBUTING.md
- License: LICENSE.md
- Citation: CITATION.cff

Awesome Lists containing this project

README

          # Zensols Natural Language Parsing

[![PyPI][pypi-badge]][pypi-link]

[![Python 3.11][python311-badge]][python311-link]

[![Python 3.12][python312-badge]][python312-link]

[![Build Status][build-badge]][build-link]

From the paper [DeepZensols: A Deep Learning Natural Language Processing

Framework for Experimentation and Reproducibility].  This framework wraps the

[spaCy] framework and creates light weight features in a class [hierarchy] that

reflects the structure of natural language.  The motivation is to generate

features from the parsed text in an object oriented fashion that is fast and

easy to pickle.

Other features include:

* [Parse and normalize] a stream of tokens as stop words, punctuation

  filters, up/down casing, porter stemming and [others].

* [Detached features] that are safe and easy to pickle to disk.

* Configuration drive parsing and token normalization using [configuration

  factories].

* Pretty print functionality for easy natural language feature selection.

* A comprehensive [scoring module] including following scoring methods:

  * [Rouge]

  * [Bleu]

  * [SemEval-2013 Task 9.1]

  * [Levenshtein distance]

  * Exact match

## Documentation

* [Framework documentation]

* [Natural Language Parsing]

* [List Token Normalizers and Mappers]

## Obtaining / Installing

The library can be installed with pip from the [pypi] repository:

```bash

pip3 install zensols.nlp

```

The smallest base spaCy model will automatically be downloaded on the first

use.  You can download other models, such as the medium base model using the

following command:

```bash

python -m spacy download en_core_web_md

```

## Usage

A parser using the default configuration can be obtained by:

```python

from zensols.nlp import FeatureDocumentParser

parser: FeatureDocumentParser = FeatureDocumentParser.default_instance()

doc = parser('Obama was the 44th president of the United States.')

for tok in doc.tokens:

    print(tok.norm, tok.pos_, tok.tag_)

print(doc.entities)

>>>

Obama PROPN NNP

was AUX VBD

the DET DT

45th ADJ JJ

president NOUN NN

of ADP IN

the United States DET DT

. PUNCT .

(, <45th>, )

```

However, minimal effort is needed to configure the parser using a [resource library]:

```python

from io import StringIO

from zensols.config import ImportIniConfig, ImportConfigFactory

from zensols.nlp import FeatureDocument, FeatureDocumentParser

CONFIG = """

# import the `zensols.nlp` library

[import]

config_file = resource(zensols.nlp): resources/obj.conf

# override the parse to keep only the norm, ent

[doc_parser]

token_feature_ids = set: ent_, tag_

"""

if (__name__ == '__main__'):

    fac = ImportConfigFactory(ImportIniConfig(StringIO(CONFIG)))

    doc_parser: FeatureDocumentParser = fac('doc_parser')

    sent = 'He was George Washington and first president of the United States.'

    doc: FeatureDocument = doc_parser(sent)

    for tok in doc.tokens:

        tok.write()

```

This uses a [resource library] to source in the configuration from this package

so minimal configuration is necessary.  More advanced configuration [examples]

are also available.

See the [feature documents] for more information.

## Scoring

Certain scores in the [scoring module] need additional Python packages.  These

are installed with:

```bash

pip install -R src/python/requirements-score.txt

```

## Attribution

This project, or example code, uses:

* [spaCy] for natural language parsing

* [msgpack] and [smart-open] for Python disk serialization

* [nltk] for the [porter stemmer] functionality

## Citation

If you use this project in your research please use the following BibTeX entry:

```bibtex

@inproceedings{landes-etal-2023-deepzensols,

    title = "{D}eep{Z}ensols: A Deep Learning Natural Language Processing Framework for Experimentation and Reproducibility",

    author = "Landes, Paul  and

      Di Eugenio, Barbara  and

      Caragea, Cornelia",

    editor = "Tan, Liling  and

      Milajevs, Dmitrijs  and

      Chauhan, Geeticka  and

      Gwinnup, Jeremy  and

      Rippeth, Elijah",

    booktitle = "Proceedings of the 3rd Workshop for Natural Language Processing Open Source Software (NLP-OSS 2023)",

    month = dec,

    year = "2023",

    address = "Singapore, Singapore",

    publisher = "Association for Computational Linguistics",

    url = "https://aclanthology.org/2023.nlposs-1.16",

    pages = "141--146"

}

```

## Changelog

An extensive changelog is available [here](CHANGELOG.md).

## Community

Please star this repository and let me know how and where you use this API.

Contributions as pull requests, feedback and any input is welcome.

## License

[MIT License](LICENSE.md)

Copyright (c) 2020 - 2025 Paul Landes

[pypi]: https://pypi.org/project/zensols.nlp/

[pypi-link]: https://pypi.python.org/pypi/zensols.nlp

[pypi-badge]: https://img.shields.io/pypi/v/zensols.nlp.svg

[python311-badge]: https://img.shields.io/badge/python-3.11-blue.svg

[python311-link]: https://www.python.org/downloads/release/python-3110

[python312-badge]: https://img.shields.io/badge/python-3.12-blue.svg

[python312-link]: https://www.python.org/downloads/release/python-3120

[build-badge]: https://github.com/plandes/nlparse/workflows/CI/badge.svg

[build-link]: https://github.com/plandes/nlparse/actions

[DeepZensols: A Deep Learning Natural Language Processing Framework for Experimentation and Reproducibility]: https://aclanthology.org/2023.nlposs-1.16.pdf

[examples]: https://github.com/plandes/nlparse/tree/master/example/config

[hierarchy]: https://plandes.github.io/nlparse/api/zensols.nlp.html#zensols.nlp.container.FeatureDocument

[Parse and normalize]: https://plandes.github.io/nlparse/doc/parse.html

[others]: https://plandes.github.io/nlparse/doc/normalizers.html

[Detached features]: https://plandes.github.io/nlparse/doc/parse.html#detached-features

[full documentation]: https://plandes.github.io/nlparse/

[Framework documentation]: https://plandes.github.io/nlparse/api.html

[Natural Language Parsing]: https://plandes.github.io/nlparse/doc/parse.html

[List Token Normalizers and Mappers]: https://plandes.github.io/nlparse/doc/normalizers.html

[resource library]: https://plandes.github.io/util/doc/config.html#resource-libraries

[spaCy]: https://spacy.io

[nltk]: https://www.nltk.org

[smart-open]: https://pypi.org/project/smart-open/

[msgpack]: https://msgpack.org

[porter stemmer]: https://tartarus.org/martin/PorterStemmer/

[configuration factories]: https://plandes.github.io/util/doc/config.html#configuration-factory

[feature documents]: https://plandes.github.io/nlparse/doc/feature-doc.html

[scoring module]: https://plandes.github.io/nlparse/api/zensols.nlp.html#zensols-nlp-score

[Rouge]: https://aclanthology.org/W04-1013

[Bleu]: https://aclanthology.org/P02-1040

[SemEval-2013 Task 9.1]: https://web.archive.org/web/20150131105418/https://www.cs.york.ac.uk/semeval-2013/task9/data/uploads/semeval_2013-task-9_1-evaluation-metrics.pdf

[Levenshtein distance]: https://en.wikipedia.org/wiki/Levenshtein_distance

ecosyste.ms

Data

Tools

Indexes

Applications

Experiments

Awesome

https://github.com/plandes/nlparse

Awesome Lists containing this project

README