https://github.com/davityak03/nlp-bag-of-words-and-tf-idf

bag-of-words count-vectorizer nlp nltk porterstemmer python python-regex stopwords tf-idf-vectorizer wordnetlemmatizer

Last synced: 9 months ago
JSON representation

Host: GitHub
URL: https://github.com/davityak03/nlp-bag-of-words-and-tf-idf
Owner: Davityak03
Created: 2024-07-24T10:10:29.000Z (over 1 year ago)
Default Branch: main
Last Pushed: 2024-07-24T10:16:41.000Z (over 1 year ago)
Last Synced: 2025-01-18T12:21:07.573Z (11 months ago)
Topics: bag-of-words, count-vectorizer, nlp, nltk, porterstemmer, python, python-regex, stopwords, tf-idf-vectorizer, wordnetlemmatizer
Language: Jupyter Notebook
Homepage:
Size: 26.4 KB
Stars: 0
Watchers: 1
Forks: 0
Open Issues: 0
Metadata Files:
- Readme: README.md

Awesome Lists containing this project

README

          # NLP-Bag-of-Words-&-TF-IDF

## Overview

This project demonstrates how to preprocess text data using two common Natural Language Processing (NLP) techniques: Bag of Words (BoW) and Term Frequency-Inverse Document Frequency (TF-IDF). Both methods are used to convert text documents into numerical feature vectors that can be used for machine learning models and text analysis.

## Key Concepts

### Bag of Words (BoW)

The Bag of Words model is a simple and widely used method for text representation. It transforms text into a fixed-length vector of word counts, ignoring the order and grammar of words. Here’s a brief overview:

- **Vocabulary Creation**: Build a vocabulary of all unique words in the entire corpus.

- **Vector Representation**: For each document, create a vector where each position represents a word from the vocabulary. The value at each position is the count of the word in the document.

**Advantages**:

- Simple and easy to implement.

- Effective for basic text classification tasks.

**Disadvantages**:

- Ignores word order and context.

- Can lead to large feature vectors with high dimensionality.

### Term Frequency-Inverse Document Frequency (TF-IDF)

TF-IDF is an advanced text representation technique that combines term frequency (TF) and inverse document frequency (IDF) to capture the importance of words in a document relative to a corpus. Here’s how it works:

- **Term Frequency (TF)**: Measures how frequently a term appears in a document. It is calculated as:

  \[

  \text{TF}(t, d) = \frac{\text{Number of times term } t \text{ appears in document } d}{\text{Total number of terms in document } d}

  \]

- **Inverse Document Frequency (IDF)**: Measures how important a term is across all documents. It is calculated as:

  \[

  \text{IDF}(t, D) = \log \left( \frac{\text{Total number of documents } N}{\text{Number of documents containing term } t} \right)

  \]

- **TF-IDF**: Combines TF and IDF to provide a score for each term in each document:

  \[

  \text{TF-IDF}(t, d, D) = \text{TF}(t, d) \times \text{IDF}(t, D)

  \]

**Advantages**:

- Considers both term frequency and the rarity of terms.

- Provides a better representation of important words in documents.

**Disadvantages**:

- More complex than BoW.

- Still ignores word order and context.

## Project Structure

- **`notebooks/`**: Jupyter notebooks demonstrating the usage of BoW and TF-IDF with examples.

- **`README.md`**: This file.

## Requirements

- Python 3.x

- `nltk`

- `scikit-learn`

- `pandas`

ecosyste.ms

Data

Tools

Indexes

Applications

Experiments

Awesome

https://github.com/davityak03/nlp-bag-of-words-and-tf-idf

Awesome Lists containing this project

README