Ecosyste.ms: Awesome
An open API service indexing awesome lists of open source software.
https://github.com/mark-watson/nlp_smalltalk
A natural language processing toolkit for Pharo Smalltalk
https://github.com/mark-watson/nlp_smalltalk
Last synced: 3 months ago
JSON representation
A natural language processing toolkit for Pharo Smalltalk
- Host: GitHub
- URL: https://github.com/mark-watson/nlp_smalltalk
- Owner: mark-watson
- License: other
- Created: 2012-07-10T01:47:03.000Z (over 12 years ago)
- Default Branch: master
- Last Pushed: 2023-02-21T23:57:54.000Z (over 1 year ago)
- Last Synced: 2024-07-05T21:07:04.817Z (4 months ago)
- Language: Smalltalk
- Size: 1.57 MB
- Stars: 61
- Watchers: 20
- Forks: 10
- Open Issues: 0
-
Metadata Files:
- Readme: README.md
- Funding: .github/FUNDING.yml
- License: LICENSE.txt
Awesome Lists containing this project
- awesome-pharo-ml - nlp_smalltalk - natural language processing library. Implements part of speech tagging, categorization, named entity recognition, sentence segmentation, and summarization (Natural Language Processing)
README
# Natural Language Processing Library for Pharo Smalltalk
Copyright 2005 to 2021 by Mark Watson
License: MIT
Note: the most frequent updates to this Pharo Smalltalk package will appear on the [github repo for this project](https://github.com/mark-watson/nlp_smalltalk).
Note 2: on 4/25/2021 I converted this project to use the IceBerg github support for Pharo Smalltalk. All source code and data have been moved to the subdirectory **src**.
IceBerg/github documentation: [https://books.pharo.org/booklet-ManageCode/pdf/2019-03-24-ManageCode.pdf](https://books.pharo.org/booklet-ManageCode/pdf/2019-03-24-ManageCode.pdf)
Add this repository using the IcewBerg Browser.
## Setup to be done one time after loading the code via IceBerg
### Part Of Speech Tagging
Open a File Browser and fileIn the KBSnlp.st source file. Open a Class Browser
and and look at the code in the KBnlp class.Open a Workspace and one time only evaluate:
NLPtagger initializeLexicon
Try tagging a sentence to make sure the data was read from disk correctly:
NLPtagger pptag: 'The dog ran down the street'
If this does not work then probably the directory nlp_smalltalk is not in the default directory. The code containing the file path is:
read := (FileStream fileNamed: './nlp_smalltalk/lexicon.txt') readOnly.
### Categorization
I am using NeoJSON to parse the category word count data so make sure NeoJSON is installed. NeoJSON can be installed using:
Gofer it
smalltalkhubUser: 'SvenVanCaekenberghe' project: 'Neo';
configurationOf: 'NeoJSON';
loadStable.One time initialization:
NLPcategories initializeCategoryHash
Try it:NLPcategories classify: 'The economy is bad and taxes are too high.'
### Entity RecognitionImplemented for products, companies, places, and people's names.
One time initialization:
NLPentities initializeEntities
Example:NLPentities entities: 'The Coca Cola factory is in London'
--> a Dictionary('companies'->a Set('Coca Cola') 'places'->a Set('London') 'products'->a Set('Coca Cola') )
NLPentities humanNameHelper: 'John Alex Smith and Andy Jones went to the store.'
--> a Set('John Alex Smith' 'Andy Jones')### Sentence Segmentation
One time initialization:
NLPsentences loadData
NLPsentences sentences: 'Today Mr. Jones went to town. He bought gas.'
--> an OrderedCollection(an OrderedCollection('Today' 'Mr.' 'Jones' 'went' 'to' 'town' '.') an OrderedCollection('He' 'bought' 'gas' '.'))
### SummarizationNo additional data needs to be loaded for summarization, but all other data should be loaded as-per the above directions. Here is a short example:
NLPsummarizer summarize: 'The administration and House Republicans have asked a federal appeals court for a 90-day extension in a case that involves federal payments to reduce deductibles and copayments for people with modest incomes who buy their own policies. The fate of $7 billion in "cost-sharing subsidies" remains under a cloud as insurers finalize their premium requests for next year. Experts say premiums could jump about 20 percent without the funding. In requesting the extension, lawyers for the Trump administration and the House said the parties are continuing to work on measures, including potential legislative action, to resolve the issue. Requests for extensions are usually granted routinely.'
--> #('The administration and House Republicans have asked a federal appeals court for a 90-day extension in a case that involves federal payments to reduce deductibles and copayments for people with modest incomes who buy their own policies .' 'The fate of $ 7 billion in "cost-sharing subsidies" remains under a cloud as insurers finalize their premium requests for next year .' 'In requesting the extension , lawyers for the Trump administration and the House said the parties are continuing to work on measures , including potential legislative action , to resolve the issue .')
## Limitations- Does not currently handle special characters like: —
- Categorization and summarization should also use "bag of ngrams" in addition to "bag of words" (BOW)