Ecosyste.ms: Awesome

An open API service indexing awesome lists of open source software.

https://github.com/mark-watson/nlp_smalltalk

A natural language processing toolkit for Pharo Smalltalk
https://github.com/mark-watson/nlp_smalltalk

Last synced: 4 months ago
JSON representation

A natural language processing toolkit for Pharo Smalltalk

Lists

README

        

# Natural Language Processing Library for Pharo Smalltalk

Copyright 2005 to 2021 by Mark Watson

License: MIT

Note: the most frequent updates to this Pharo Smalltalk package will appear on the [github repo for this project](https://github.com/mark-watson/nlp_smalltalk).

Note 2: on 4/25/2021 I converted this project to use the IceBerg github support for Pharo Smalltalk. All source code and data have been moved to the subdirectory **src**.

IceBerg/github documentation: [https://books.pharo.org/booklet-ManageCode/pdf/2019-03-24-ManageCode.pdf](https://books.pharo.org/booklet-ManageCode/pdf/2019-03-24-ManageCode.pdf)

Add this repository using the IcewBerg Browser.

## Setup to be done one time after loading the code via IceBerg

### Part Of Speech Tagging

Open a File Browser and fileIn the KBSnlp.st source file. Open a Class Browser
and and look at the code in the KBnlp class.

Open a Workspace and one time only evaluate:

NLPtagger initializeLexicon

Try tagging a sentence to make sure the data was read from disk correctly:

NLPtagger pptag: 'The dog ran down the street'

If this does not work then probably the directory nlp_smalltalk is not in the default directory. The code containing the file path is:

read := (FileStream fileNamed: './nlp_smalltalk/lexicon.txt') readOnly.

### Categorization

I am using NeoJSON to parse the category word count data so make sure NeoJSON is installed. NeoJSON can be installed using:

Gofer it
smalltalkhubUser: 'SvenVanCaekenberghe' project: 'Neo';
configurationOf: 'NeoJSON';
loadStable.

One time initialization:

NLPcategories initializeCategoryHash

Try it:

NLPcategories classify: 'The economy is bad and taxes are too high.'

### Entity Recognition

Implemented for products, companies, places, and people's names.

One time initialization:

NLPentities initializeEntities

Example:

NLPentities entities: 'The Coca Cola factory is in London'

--> a Dictionary('companies'->a Set('Coca Cola') 'places'->a Set('London') 'products'->a Set('Coca Cola') )

NLPentities humanNameHelper: 'John Alex Smith and Andy Jones went to the store.'

--> a Set('John Alex Smith' 'Andy Jones')

### Sentence Segmentation

One time initialization:

NLPsentences loadData

NLPsentences sentences: 'Today Mr. Jones went to town. He bought gas.'

--> an OrderedCollection(an OrderedCollection('Today' 'Mr.' 'Jones' 'went' 'to' 'town' '.') an OrderedCollection('He' 'bought' 'gas' '.'))

### Summarization

No additional data needs to be loaded for summarization, but all other data should be loaded as-per the above directions. Here is a short example:

NLPsummarizer summarize: 'The administration and House Republicans have asked a federal appeals court for a 90-day extension in a case that involves federal payments to reduce deductibles and copayments for people with modest incomes who buy their own policies. The fate of $7 billion in "cost-sharing subsidies" remains under a cloud as insurers finalize their premium requests for next year. Experts say premiums could jump about 20 percent without the funding. In requesting the extension, lawyers for the Trump administration and the House said the parties are continuing to work on measures, including potential legislative action, to resolve the issue. Requests for extensions are usually granted routinely.'

--> #('The administration and House Republicans have asked a federal appeals court for a 90-day extension in a case that involves federal payments to reduce deductibles and copayments for people with modest incomes who buy their own policies .' 'The fate of $ 7 billion in "cost-sharing subsidies" remains under a cloud as insurers finalize their premium requests for next year .' 'In requesting the extension , lawyers for the Trump administration and the House said the parties are continuing to work on measures , including potential legislative action , to resolve the issue .')

## Limitations

- Does not currently handle special characters like: —
- Categorization and summarization should also use "bag of ngrams" in addition to "bag of words" (BOW)