https://github.com/marios-mamalis/asolut

A solution for the synonym problem in word frequency algorithms
https://github.com/marios-mamalis/asolut

gui-application synonyms word-frequency

Last synced: 4 months ago
JSON representation

A solution for the synonym problem in word frequency algorithms

Host: GitHub
URL: https://github.com/marios-mamalis/asolut
Owner: Marios-Mamalis
License: mit
Created: 2020-04-16T14:15:33.000Z (over 5 years ago)
Default Branch: master
Last Pushed: 2024-07-25T20:21:32.000Z (over 1 year ago)
Last Synced: 2025-04-18T08:54:35.690Z (6 months ago)
Topics: gui-application, synonyms, word-frequency
Language: Python
Homepage:
Size: 211 KB
Stars: 2
Watchers: 2
Forks: 1
Open Issues: 0
Metadata Files:
- Readme: README.md
- License: LICENSE

Awesome Lists containing this project

README

          
asolut




A solution for the synonym problem in word frequency algorithms 



![Downloads](https://img.shields.io/pepy/dt/asolut?label=Downloads) ![PyPI - License](https://img.shields.io/pypi/l/asolut?color=red) ![PyPI](https://img.shields.io/pypi/v/asolut?label=version)

This library contains the official implementation of the synonym-augmented frequency algorithm presented

in "[A solution for the synonym problem in word frequency algorithms](https://doi.org/10.13140/RG.2.2.14789.27369)",

along with a GUI wrapper and text preprocessing utilities.

## Installation

The package requires Python 3.7.3 and can be installed through PyPi with the following command:

```commandline

pip install asolut

```

Additionaly, the NLTK `stopwords`, `averaged_perceptron_tagger` and `wordnet` resources are needed.

## Reference

### asolut.preprocessing

```python

asolut.preprocessing(texts, pos=None, chrsplt="\s|\\\\|/",

                     keepstopwords=False, mode="normal", chng=True)

```

Performs basic text preprocessing on a given string. Preprocessing includes tokenization, Part of Speech filtering,

stopword removal, special character handling and lemmatization.

#### Parameters:

- `texts: str`  

The text to be preprocessed. Can be any valid string.

- `pos: [str, ...]`, default=`None`  

The parts of speech that should be included in the output. Any word corresponding to a PoS not contained in the list will be 

discarded. List items must be valid [Penn Treebank PoS tags](https://www.ling.upenn.edu/courses/Fall_2003/ling001/penn_treebank_pos.html).

The actual default value of the parameter, assigned later in the function, is the following list:

`["JJ", "JJR", "JJS", "NN", "NNS", "NNP", "NNPS", "RB", "RBR", "RBS", "VB", "VBD", "VBG", "VBN", "VBP", "VBZ"]`

(adjectives, adverbs, verbs and nouns).

- `chrsplt: str`, default=`"\s|\\\\|/"`  

Regular Expression pattern that defines the character/s on which the text should be split at. 

Must be a valid RegEx pattern.

- `keepstopwords: bool`, default=`False`  

Specifies whether stop words should be kept (`True`) or discarded (`False`).

- `mode: {"none", "normal", "extended", "full", "custom (custom pattern)"}`, default=`"normal"`  

Defines which special characters contained in words should be removed.

Can be `"none"`, `"normal"`, `"extended"`, `"full"`, or any valid RegEx pattern preceded by the

characters `"custom "` (e.g. "custom a|b").

The predefined RegEx patterns are as follows:

  - `"none"`: no RegEx pattern (keeps words unchanged)

  - `"normal"`: `^\W+|\W+$`

  - `"extended"`: `^[^\w°؋฿₿¢₡₵$₫֏€ƒ₲₾₴₭₺₼₥₦₱£﷼៛ރ₽₨௹₹৲૱₪₸₮₩¥₳₠₢₯₣₤₶ℳ₧₰₷©™®]+|[^\w°؋฿₿¢₡₵$₫֏€ƒ₲₾₴₭₺₼₥₦₱£﷼៛ރ₽₨௹₹৲૱₪₸₮₩¥₳₠₢₯₣₤₶ℳ₧₰₷©™®]+$`

  - `"full"`: `\W`

- `chng: bool`, default=`True`  

Specifies whether words should be  lemmatized (`True`) or not (`False`).

#### Returns:

- `textlist: [str, ...]`    

The pre-processed text as a list of tokens.

### asolut.freqs

```python

asolut.freqs(textlist, sortedby="sum", returntype="plot", figtitle="plot", numb=None)

```

Calculates the frequencies of words by taking into account their synonyms.

#### Parameters:

- `textlist: [str, ...]`  

A list of tokens. Preferably, word-level tokens.

- `sortedby: {"frequencies", "synonym frequencies", "sum"}`, default=`"sum"`  

Specifies the type of frequency the output should be ordered by (descending).

  - `"frequencies"`: Standard word frequencies.

  - `"synonym frequencies"`: Solely word synonym frequencies.

  - `"sum"`: The sum of both synonym and word frequencies.

- `returntype: {"plot", "data", "both"}`, default=`"plot"`  

Specifies the output of the function.

  - `"plot"`: Creates and saves an interactive html horizontal stacked bar chart. Returns `None`.

  - `"data"`: Returns the resulting information as a `pandas.DataFrame` object.

  - `"both"`: Creates and saves the interactive html barplot and returns the information as a `pandas.DataFrame` object.

If a plot is chosen to be generated, it is of the following format:



  



- `figtitle: str`, default=`"plot"`  

If a plot was chosen to be created, this parameter specifies the filename under which it will be saved.

- `numb: int`, default=`None`  

Specifies the number of bars depicted in the barplot. The value of `numb` is given by this function:

$$

numb = 

\begin{cases} 

min(15, n\\_unique, numb\\_input) & \text{, if } numb\\_input \gt 0 \\

min(15, n\\_unique) & \text{, if } numb\\_input \le 0

\end{cases} 

$$

where `n_unique` is the number of unique words after pre-processing

and `numb_input` is the user input for the `numb` parameter. The input must be a positive integer.

#### Returns: 

- `data: pandas.DataFrame or None`  

The DataFrame containing the calculated counts. It is of the following format:

| Words     | Counts | Synonym Counts | List of synonyms      |

|-----------|--------|----------------|-----------------------|

| headphone | 1      | 3              | [earphone, earpiece]  |

| flower    | 1      | 0              | []                    |

| earphone  | 2      | 2              | [earpiece, headphone] |

| earpiece  | 1      | 3              | [earphone, headphone] |

### asolut.gui

```python

asolut.gui()

```

Displays a graphical user interface that serves as a wrapper for the aforementioned functions,

in order to make the tool accessible to non developers. Can only generate the horizontal stacked bar chart.

ecosyste.ms

Data

Tools

Indexes

Applications

Experiments

Awesome

https://github.com/marios-mamalis/asolut

Awesome Lists containing this project

README

asolut