https://github.com/dscmatter/tf-idf-document_scorer

TF-IDF (Term frequency, Inverse Document Frequency) is an algorithm or way to score the importance of words (or 'terms') based on how frequently they appear
https://github.com/dscmatter/tf-idf-document_scorer

algorithm python tf-idf-score

Last synced: over 1 year ago
JSON representation

TF-IDF (Term frequency, Inverse Document Frequency) is an algorithm or way to score the importance of words (or 'terms') based on how frequently they appear

Host: GitHub
URL: https://github.com/dscmatter/tf-idf-document_scorer
Owner: DSCmatter
License: mit
Created: 2024-01-29T12:00:18.000Z (over 2 years ago)
Default Branch: main
Last Pushed: 2024-06-28T09:52:24.000Z (about 2 years ago)
Last Synced: 2025-02-11T09:51:13.253Z (over 1 year ago)
Topics: algorithm, python, tf-idf-score
Language: Python
Homepage:
Size: 23.4 KB
Stars: 0
Watchers: 1
Forks: 0
Open Issues: 0
Metadata Files:
- Readme: README.md
- License: LICENSE

Awesome Lists containing this project

README

# TF-IDF
TF-IDF (Term frequency, Inverse Document Frequency) is an algorithm or way to score the importance of words (or 'terms') based on how frequently they appear

which means
- If a word appears frequently in a document, it's important. Give the word a high score.
- But if a word appears in many documents, it's not a unique identifier. Give the word a low score.

## Prerequisites:
Before using this TF-IDF implementation, ensure you have the following packages installed:

- textblob
- nltk

You can install these packages using pip:
'pip install textblob nltk'

## Improved TF-IDF Implementation
This implementation of TF-IDF features improvements such as:

- Utilizing NLTK to download stopwords and tokenize the text.
- Filtering out stopwords from the document before calculating TF-IDF scores.
- Lowercasing the words to ensure case insensitivity.
- Calculating TF-IDF scores based on the filtered document.

## Usage
- Ensure you have Python installed on your system.
- Install the required packages using pip as mentioned in the Prerequisites section.
- Clone or download this repository.
- Navigate to the directory containing the TF-IDF script.
- Run the script and follow the prompts to enter the location of the document file.

The script will calculate the TF-IDF scores and display the top words along with their scores.

## Examples
Two example text files have been provided in the repository for testing the TF-IDF algorithm.

- text.txt
- text2.txt

## Further Reading
- For more information on TF-IDF and its applications, visit the following link:

- [TF-IDF Explained - Steven Loria](https://stevenloria.com/tf-idf/)

## License
- This project is licensed under the [MIT License](LICENSE) - see the [LICENSE](LICENSE) file for details.

ecosyste.ms

Data

Tools

Indexes

Applications

Experiments

Awesome

https://github.com/dscmatter/tf-idf-document_scorer

Awesome Lists containing this project

README