https://github.com/kanishknavale/irtm-toolbox

This repository holds functions pivotal for IRTM processing. This repository is staged for continuous development.
https://github.com/kanishknavale/irtm-toolbox

information-retrieval page-rank python soundex soundex-algorithm text-mining tf-idf token-importance tokenizer

Last synced: about 1 month ago
JSON representation

This repository holds functions pivotal for IRTM processing. This repository is staged for continuous development.

Host: GitHub
URL: https://github.com/kanishknavale/irtm-toolbox
Owner: KanishkNavale
License: mit
Created: 2021-08-28T20:54:09.000Z (almost 4 years ago)
Default Branch: master
Last Pushed: 2021-09-06T11:57:17.000Z (over 3 years ago)
Last Synced: 2025-04-02T02:03:03.002Z (2 months ago)
Topics: information-retrieval, page-rank, python, soundex, soundex-algorithm, text-mining, tf-idf, token-importance, tokenizer
Language: Python
Homepage: https://pypi.org/project/irtm/#description
Size: 110 KB
Stars: 1
Watchers: 1
Forks: 2
Open Issues: 0
Metadata Files:
- Readme: README.md
- License: LICENSE

Awesome Lists containing this project

README

        # Information Retrieval & Text Mining Toolbox

This repository holds functions pivotal for IRTM processing. This repo. is staged for continuous development.

## Quick Install using 'pip/pip3' & GitHub

```bash

pip install git+https://github.com/KanishkNavale/IRTM-Toolbox.git

```

## Import Module

```python

from irtm.toolbox import *

```

## Using Functions

1. Soundex: A phonetic algorithm for indexing names by sound, as pronounced in English.

    ```python

    print(soundex('Muller'))

    print(soundex('Mueller'))

    ```

    ```bash

    >>> 'M466'

    >>> 'M466'

    ```

2. Tokenizer: Converts a sequence of characters into a sequence of tokens.

    ```python

    print(tokenize('LINUX'))

    print(tokenize('Text Mining 2021'))

    ```

    ```bash

    >>> ['linux']

    >>> ['text', 'mining']

    ```

3. Vectorize: Converts a string to token based weight tensor.

    ```python

    vector = vectorize([

            'texts ([string]): a multiline or a single line string.',

            'dict ([list], optional): list of tokens. Defaults to None.',

            'enable_Idf (bool, optional): use IDF or not. Defaults to True.',

            'normalize (str, optional): normalization of vector. Defaults to l2.',

            'max_dim ([int], optional): dimension of vector. Defaults to None.',

            'smooth (bool, optional): restricts value >0. Defaults to True.',

            'weightedTf (bool, optional): Tf = 1+log(Tf). Defaults to True.',

            'return_features (bool, optional): feature vector. Defaults to False.'

            ])

    print(f'Vector Shape={vector.shape}')

    ```

    ```bash

    >>> Vector Shape=(8, 37)

    ```

4. Predict Token Weights: Computes importance of a token based on classification optimization.

    ```python

    dictionary = ['vector', 'string', 'bool']

    vector = vectorize([

            'X ([np.array]): vectorized matrix columns arraged as per the dictionary.',

            'y ([labels]): True classification labels.',

            'epochs ([int]): Optimization epochs.',

            'verbose (bool, optional): Enable verbose outputs. Defaults to False.',

            'dict ([type], optional): list of tokens. Defaults to None.'

            ], dict=dictionary)

    labels = np.random.randint(1, size=(vector.shape[0], 1))

    weights = predict_weights(vector, labels, 100, dict=dictionary)

    ```

    ```bash

    >>> Token-Weights Mappings: {'vector': 0.22097790924850977, 

                                 'string': 0.39296369957440075, 

                                 'bool': 0.689853175081446}

    ```

5. Page Rank: Computes page rank from a chain matrix

    ```python

    chain_matrix = np.array([[0, 0, 1],

                             [1, 0, 1],

                             [0, 1, 0]])

    print(page_rank(chain_matrix))

    

    rank, TPM = page_rank(chain_matrix, return_TransMatrix=True)

    print(f'Page Rank: {rank} \nTransition Probablity Matrix: \n{TPM}')

    ```

    ```bash

    >>> [0.0047 0.997  0.0767]

    >>> Page Rank: [0.0047 0.997  0.0767] 

        Transition Probablity Matrix: 

        [[0.03333333 0.03333333 0.93333333]

        [0.48333333 0.03333333 0.48333333]

        [0.03333333 0.93333333 0.03333333]]

    ```

ecosyste.ms

Data

Tools

Indexes

Applications

Experiments

Awesome

https://github.com/kanishknavale/irtm-toolbox

Awesome Lists containing this project

README