https://github.com/tomijuarez/lemmatisation

Lemmatisation fully implemented in Java.
https://github.com/tomijuarez/lemmatisation

algorithms data-analysis data-science java-8 lemmatization oop

Last synced: over 1 year ago
JSON representation

Lemmatisation fully implemented in Java.

Host: GitHub
URL: https://github.com/tomijuarez/lemmatisation
Owner: tomijuarez
Created: 2018-01-30T12:31:45.000Z (over 8 years ago)
Default Branch: master
Last Pushed: 2018-01-30T12:31:50.000Z (over 8 years ago)
Last Synced: 2025-02-14T14:51:37.234Z (over 1 year ago)
Topics: algorithms, data-analysis, data-science, java-8, lemmatization, oop
Language: Java
Size: 5.86 KB
Stars: 0
Watchers: 2
Forks: 0
Open Issues: 0
Metadata Files:
- Readme: README.md

Awesome Lists containing this project

README

# Lemmatisation
A lemmatisation fully implemented in Java that solves the following problem:

*Take this paragraph of text and return an alphabetized list of ALL unique words. A unique word is any form of a word often communicated with essentially the same meaning. For example, fish and fishes could be defined as a unique word by using their stem fish. For each unique word found in this entire paragraph, determine the how many times the word appears in total. Also, provide an analysis of what unique sentence index position or positions the word is found. The following words should not be included in your analysis or result set: "a", "the", "and", "of", "in", "be", "also" and "as". Your final result MUST be displayed in a readable console output in the same format as the JSON sample object shown below.
Sample Output:*

```json
{

"results": [

{

"word": "ALL",

"total-occurrences": 1,

"sentence-indexes": [0]

{

"word": "alphabetized",

"total-occurrences": 1,

"sentence-indexes": [0]

{

"word": "analysis",

"total-occurrences": 2,

"sentence-indexes": [4, 5]

...

]

}
```

The following assumpstions were made:

1. The structure prints the stems and not the words. For example; *fishes* and *fish* will be saved as *fish*.
2. The sentences are lister for each stem only once. This is because the program uses a set. If the set is changed by a simple list then the occurrence object will list more than once the same sentence. For example if we have the words *fish* and *fishes* in the first sentence of a text, then it only counts 1 once.

To run this program you have to install maven and Java version 1.8. If you are using Linux you might find useful a script called "run.sh" that generates the jar with its dependencies and after that runs the jar. You should give executable permissions using the following commands: `chmod +x "run.sh"`. If you are on Windows or OSX you have to run manually the commands on that script.

ecosyste.ms

Data

Tools

Indexes

Applications

Experiments

Awesome

https://github.com/tomijuarez/lemmatisation

Awesome Lists containing this project

README