Ecosyste.ms: Awesome
An open API service indexing awesome lists of open source software.
https://github.com/dscmatter/tf-idf-document_scorer
TF-IDF (Term frequency, Inverse Document Frequency) is an algorithm or way to score the importance of words (or 'terms') based on how frequently they appear
https://github.com/dscmatter/tf-idf-document_scorer
algorithm python tf-idf-score
Last synced: about 1 month ago
JSON representation
TF-IDF (Term frequency, Inverse Document Frequency) is an algorithm or way to score the importance of words (or 'terms') based on how frequently they appear
- Host: GitHub
- URL: https://github.com/dscmatter/tf-idf-document_scorer
- Owner: DSCmatter
- License: mit
- Created: 2024-01-29T12:00:18.000Z (about 1 year ago)
- Default Branch: main
- Last Pushed: 2024-06-28T09:52:24.000Z (7 months ago)
- Last Synced: 2024-06-28T11:14:18.944Z (7 months ago)
- Topics: algorithm, python, tf-idf-score
- Language: Python
- Homepage:
- Size: 23.4 KB
- Stars: 0
- Watchers: 1
- Forks: 0
- Open Issues: 0
-
Metadata Files:
- Readme: README.md
- License: LICENSE
Awesome Lists containing this project
README
# TF-IDF
TF-IDF (Term frequency, Inverse Document Frequency) is an algorithm or way to score the importance of words (or 'terms') based on how frequently they appearwhich means
- If a word appears frequently in a document, it's important. Give the word a high score.
- But if a word appears in many documents, it's not a unique identifier. Give the word a low score.## Prerequisites:
Before using this TF-IDF implementation, ensure you have the following packages installed:- textblob
- nltkYou can install these packages using pip:
'pip install textblob nltk'## Improved TF-IDF Implementation
This implementation of TF-IDF features improvements such as:- Utilizing NLTK to download stopwords and tokenize the text.
- Filtering out stopwords from the document before calculating TF-IDF scores.
- Lowercasing the words to ensure case insensitivity.
- Calculating TF-IDF scores based on the filtered document.## Usage
- Ensure you have Python installed on your system.
- Install the required packages using pip as mentioned in the Prerequisites section.
- Clone or download this repository.
- Navigate to the directory containing the TF-IDF script.
- Run the script and follow the prompts to enter the location of the document file.The script will calculate the TF-IDF scores and display the top words along with their scores.
## Examples
Two example text files have been provided in the repository for testing the TF-IDF algorithm.- text.txt
- text2.txt
## Further Reading
- For more information on TF-IDF and its applications, visit the following link:- [TF-IDF Explained - Steven Loria](https://stevenloria.com/tf-idf/)
## License
- This project is licensed under the [MIT License](LICENSE) - see the [LICENSE](LICENSE) file for details.