Ecosyste.ms: Awesome

An open API service indexing awesome lists of open source software.

Awesome Lists | Featured Topics | Projects

https://github.com/linguistic-dev/n-gram-extractor

A PHP Library to extract n-grams from a text. Simple preprocessing tools (cleaning, tokenizing) included.
https://github.com/linguistic-dev/n-gram-extractor

natural-language-processing ngram ngram-analysis ngrams nlp php php-library php7 tokenization tokenize tokenized-sentences tokenizer

Last synced: about 1 month ago
JSON representation

A PHP Library to extract n-grams from a text. Simple preprocessing tools (cleaning, tokenizing) included.

Awesome Lists containing this project

README

        

# NGramExtractor for PHP

## Installation

Simple install via Composer:

```
composer require linguistic/ngramextractor
```

## Usage

Coming soon.

## Example

```php
$tokenizer = new Tokenizer();
$tokenizer->addRemovalRule('/<\/?\w+[\s\w\=\"\/\#\-\:\.\_]*>/') # Removes HTML Tags
->addRemovalRule('/[^a-z0-9]+/', ' ') # Replaces everything which is not text with a space
->setSeperator('/\s+/'); # Tokenizes text with whitespace as delimiter
```

```php
$content = ""; # The text that should get tokenized
$stopwords = array(); # (optional) array of stopwords

$extractor = new NGramExtractor($content, $tokenizer, $stopwords);
$unigrams = $extractor->getNGrams(1); # gets all n-grams in the text, n = 1

$unigramsFiltered = NGramExtractor::limitByOccurance($extractor->getNGramCount(1, true), 3); # get unigrams and their occurance if the occurance is greater or equal 3
```
## Ressources

* [Download of stopword lists for different languages](http://members.unine.ch/jacques.savoy/clef/index.html)