https://github.com/kanishknavale/irtm-toolbox
This repository holds functions pivotal for IRTM processing. This repository is staged for continuous development.
https://github.com/kanishknavale/irtm-toolbox
information-retrieval page-rank python soundex soundex-algorithm text-mining tf-idf token-importance tokenizer
Last synced: about 1 month ago
JSON representation
This repository holds functions pivotal for IRTM processing. This repository is staged for continuous development.
- Host: GitHub
- URL: https://github.com/kanishknavale/irtm-toolbox
- Owner: KanishkNavale
- License: mit
- Created: 2021-08-28T20:54:09.000Z (almost 4 years ago)
- Default Branch: master
- Last Pushed: 2021-09-06T11:57:17.000Z (over 3 years ago)
- Last Synced: 2025-04-02T02:03:03.002Z (2 months ago)
- Topics: information-retrieval, page-rank, python, soundex, soundex-algorithm, text-mining, tf-idf, token-importance, tokenizer
- Language: Python
- Homepage: https://pypi.org/project/irtm/#description
- Size: 110 KB
- Stars: 1
- Watchers: 1
- Forks: 2
- Open Issues: 0
-
Metadata Files:
- Readme: README.md
- License: LICENSE
Awesome Lists containing this project
README
# Information Retrieval & Text Mining Toolbox
This repository holds functions pivotal for IRTM processing. This repo. is staged for continuous development.
## Quick Install using 'pip/pip3' & GitHub
```bash
pip install git+https://github.com/KanishkNavale/IRTM-Toolbox.git
```## Import Module
```python
from irtm.toolbox import *
```## Using Functions
1. Soundex: A phonetic algorithm for indexing names by sound, as pronounced in English.
```python
print(soundex('Muller'))
print(soundex('Mueller'))
``````bash
>>> 'M466'
>>> 'M466'
```2. Tokenizer: Converts a sequence of characters into a sequence of tokens.
```python
print(tokenize('LINUX'))
print(tokenize('Text Mining 2021'))
``````bash
>>> ['linux']
>>> ['text', 'mining']
```3. Vectorize: Converts a string to token based weight tensor.
```python
vector = vectorize([
'texts ([string]): a multiline or a single line string.',
'dict ([list], optional): list of tokens. Defaults to None.',
'enable_Idf (bool, optional): use IDF or not. Defaults to True.',
'normalize (str, optional): normalization of vector. Defaults to l2.',
'max_dim ([int], optional): dimension of vector. Defaults to None.',
'smooth (bool, optional): restricts value >0. Defaults to True.',
'weightedTf (bool, optional): Tf = 1+log(Tf). Defaults to True.',
'return_features (bool, optional): feature vector. Defaults to False.'
])print(f'Vector Shape={vector.shape}')
``````bash
>>> Vector Shape=(8, 37)
```4. Predict Token Weights: Computes importance of a token based on classification optimization.
```python
dictionary = ['vector', 'string', 'bool']
vector = vectorize([
'X ([np.array]): vectorized matrix columns arraged as per the dictionary.',
'y ([labels]): True classification labels.',
'epochs ([int]): Optimization epochs.',
'verbose (bool, optional): Enable verbose outputs. Defaults to False.',
'dict ([type], optional): list of tokens. Defaults to None.'
], dict=dictionary)labels = np.random.randint(1, size=(vector.shape[0], 1))
weights = predict_weights(vector, labels, 100, dict=dictionary)
``````bash
>>> Token-Weights Mappings: {'vector': 0.22097790924850977,
'string': 0.39296369957440075,
'bool': 0.689853175081446}
```5. Page Rank: Computes page rank from a chain matrix
```python
chain_matrix = np.array([[0, 0, 1],
[1, 0, 1],
[0, 1, 0]])print(page_rank(chain_matrix))
rank, TPM = page_rank(chain_matrix, return_TransMatrix=True)
print(f'Page Rank: {rank} \nTransition Probablity Matrix: \n{TPM}')
``````bash
>>> [0.0047 0.997 0.0767]
>>> Page Rank: [0.0047 0.997 0.0767]
Transition Probablity Matrix:
[[0.03333333 0.03333333 0.93333333]
[0.48333333 0.03333333 0.48333333]
[0.03333333 0.93333333 0.03333333]]
```