https://github.com/stcarrez/ada-stemmer

Multi natural language stemmer with Snowball generator
https://github.com/stcarrez/ada-stemmer

ada stemmer

Last synced: 4 months ago
JSON representation

Multi natural language stemmer with Snowball generator

Host: GitHub
URL: https://github.com/stcarrez/ada-stemmer
Owner: stcarrez
License: apache-2.0
Created: 2020-05-09T13:45:21.000Z (about 5 years ago)
Default Branch: master
Last Pushed: 2024-09-29T07:11:00.000Z (9 months ago)
Last Synced: 2025-01-15T07:09:31.118Z (6 months ago)
Topics: ada, stemmer
Language: Ada
Homepage:
Size: 33.7 MB
Stars: 4
Watchers: 5
Forks: 0
Open Issues: 0
Metadata Files:
- Readme: README.md
- License: LICENSE.txt

Awesome Lists containing this project

README

        # Ada Stemmer Library

[![Build Status](https://img.shields.io/endpoint?url=https://porion.vacs.fr/porion/api/v1/projects/ada-stemmer/badges/build.json)](https://porion.vacs.fr/porion/projects/view/ada-stemmer/summary)

[![Test Status](https://img.shields.io/endpoint?url=https://porion.vacs.fr/porion/api/v1/projects/ada-stemmer/badges/tests.json)](https://porion.vacs.fr/porion/projects/view/ada-stemmer/xunits)

[![Coverage](https://img.shields.io/endpoint?url=https://porion.vacs.fr/porion/api/v1/projects/ada-stemmer/badges/coverage.json)](https://porion.vacs.fr/porion/projects/view/ada-stemmer/summary)

[![License](https://img.shields.io/badge/license-APACHE2-blue.svg)](LICENSE)

The Ada Stemmer Library is a [stemming](https://en.wikipedia.org/wiki/Stemming) processor for several

natural languages.  It is based on the [Snowball compiler and stemming algorithms](https://snowballstem.org/)

which has been adapted to generate Ada code ([Snowball Ada](https://github.com/stcarrez/snowball/tree/ada-support)).

A stemming algorithm is used in natural language analysis to find

the base or root form of a word.  Such algorithm is specific to each natural language.

The [Porter Stemmer](https://tartarus.org/martin/PorterStemmer/) algorithm is specific to the English language

and will not work for French, Greek or Russian.

The Ada Stemmer Library integrates stemming algorithms for:

Arabic, Basque, Catalan, Danish, Dutch, English, Finnish, French, German, Greek,

Hindi, Hungarian, Indonesian, Irish, Italian, Lithuanian, Serbian, Nepali, Norwegian,

Portuguese, Romanian, Russian, Serbian, Spanish, Swedish, Tamil, Turkish.

Example of stemming:

|Language | Word          | Stem       |

|---------|---------------|------------|

|French   | chienne       | chien      |

|French   | affectionnait | affection  |

|English  | zealously     | zealous    |

|English  | transitional  | transit    |

|Greek    | ποσοτητα      | ποσοτητ    |

|Greek    | μνημειωδεσ    | μνημειωδ   |

|Russian  | ячменный      | ячмен      |

|Russian  | адом          | ад         |

## Version 1.2.1 - Under development

* Update to build with Alire

## Version 1.2.0 - May 2022

* Update to use Snowball 2.2 (the Ada code generator has been integrated in Snowball 2.2!)

* Improvement to help in running the tests

## Version 1.1.0 - Oct 2020

* Add support Arabic, Basque, Catalan, Finnish, Hindi, Hungarian, Indonesian,

  Irish, Lithuanian, Nepali, Norwegian, Porter, Portuguese, Romanian,

  Tamil, Turkish

## Version 1.0.0 - May 2020

* First implementation of the Ada Stemmer Library

# Build

Build with the following commands:

```sh

make

```

## Unit test

To build the unit test, you will need the [Ada Utility Library](https://github.com/stcarrez/ada-util).

The `make test` target will clone the git repository locally and it will configure the GNAT project

accordingly to use and build the unit tests.

```sh

make build test HAVE_ADA_UTIL=yes ADA_PROJECT_PATH=./ada-util/.alire:./ada-util:./ada-util/.alire/unit

```

And unit tests are executed with:

```sh

make test

```

The unit tests contains several reference files in `regtests/files` that come from the

[Lucene](https://lucene.apache.org) search engine unit tests.

# Examples

The samples can be built using:

```sh

gnatmake -Psamples

```

You will get two programs:

* `bin/stemargs` will give the stem of words given as program argument,

* `bin/stemwords` will read a file and stem the words to print the result.

The first argument is the language.  For example:

```sh

bin/stemargs french chienne

```

or:

```sh

bin/stemwords english LICENSE.txt

```

# Simple example

The Ada Stemmer library does not split words.  You have to give them one word at a time

to stem and it returns either the word itself or its stem.  The `Stemmer.Factory` is

the multi-language entry point.  The stemmer algorithm is created for each call.

```ada

with Stemmer.Factory;

  Ada.Text_IO.Put_Line (Stem (L_FRENCH, "chienne"));

```

It is possible to instantiate a specific stemmer algorithm and then use it to stem

words.

```ada

with Stemmer.English;

  Ctx : Stemmer.English.Context_Type;

  Result : Boolean;

  Ctx.Stem_Word ("zealously", Result);

  if Result then

     Ada.Text_IO.Put_Line (Ctx.Get_Result);

  end if;

```

# References

* [The Porter Stemming Algorithm](https://tartarus.org/martin/PorterStemmer/)

* [Snowball Manual](https://snowballstem.org/compiler/snowman.html)

* [Lucene text analysis](https://lucene.apache.org/core/8_5_1/core/org/apache/lucene/analysis/package-summary.html#package.description)

ecosyste.ms

Data

Tools

Indexes

Applications

Experiments

Awesome

https://github.com/stcarrez/ada-stemmer

Awesome Lists containing this project

README