https://github.com/jeffrine/inverted-index-search-engine

A Document Search Engine with TF-IDF.
https://github.com/jeffrine/inverted-index-search-engine

python semantic-search text-search tfidf-vectorizer

Last synced: about 1 month ago
JSON representation

A Document Search Engine with TF-IDF.

Host: GitHub
URL: https://github.com/jeffrine/inverted-index-search-engine
Owner: JeffrinE
Created: 2024-07-16T03:16:11.000Z (almost 2 years ago)
Default Branch: master
Last Pushed: 2024-07-16T05:36:04.000Z (almost 2 years ago)
Last Synced: 2025-06-08T10:02:02.648Z (12 months ago)
Topics: python, semantic-search, text-search, tfidf-vectorizer
Language: Python
Homepage:
Size: 146 KB
Stars: 0
Watchers: 1
Forks: 0
Open Issues: 0
Metadata Files:
- Readme: README.md

Awesome Lists containing this project

README

          A Document Search Engine project with TF-IDF.

# Prerequisites

---

- Python 3.5+

- pip3

- NLTK

- Scikit-learn

# 1. Data Collection

---

Here, we are using a custom dataset with data scraped from [No Starch Press](https://nostarch.com).

The dataset contains a collection of books published by the publication under tag [Programming](https://nostarch.com/catalog/programming).

## 1.1 Data Cleaning: 

---

In this step we clean the scraped data, removing any unnecessary characters.

```python

special_chars = '''!()--[]{};:'"\\, <>./?@#$%^&*_~0123456789+='''''  

  

for file in pub_name:  

    word_sc_rm = ""  

    if len(file.split()) ==1 :  

        pub_list_special_rm.append(file)  

    else:  

        for a in file:  

            if a in special_chars:  

                word_sc_rm += ' '  

            else:  

                word_sc_rm += a  

        pub_list_special_rm.append(word_sc_rm)

```

## 1.2 Data Pre-processing

---

In this step, the cleaned data is pre-processed before creating the inverted index of tokens.

The pre-processing pipeline includes tokenizing each sentence, removing stop words and finally stemming. 

```python

for name in pub_list_special_rm:  

    words = word_tokenize(name)  

    stem_word = ""  

    for a in words:  

        if a.lower() not in STOPWORDS:  

            stem_word += stemmer.stem(a) + ' '  

    pub_list_stemmed.append(stem_word.lower())

```

# 2.Indexing

---

An Inverted Index is created with each token of all sentences as keys and their indexes as values.

```python

data_dict = {}  

  

for a in range(len(pub_list_stemmed)):  

    for b in pub_list_stemmed[a].split():  

        if b not in data_dict:  

            data_dict[b] = [a]  

        else:  

            data_dict[b].append(a)

```

### Inverted Index 

---



# 3. Search Engine

---

This Search Engine uses the TF-IDF algorithm.

[**TF-IDF**](https://en.wikipedia.org/wiki/Tf%E2%80%93idf) stands for **“Term Frequency — Inverse Document Frequency”**. This is a technique to calculate the weight of each word signifies the importance of the word in the document and corpus

## 3.1 Calculating ranking using [Cosine Similarity](https://en.wikipedia.org/wiki/Cosine_similarity).

---

It is the most common metric used to calculate the similarity between document text.



## Generating TF-IDF using TfidfVectorizer

---

```python

temp_file = tfidf.fit_transform(temp_file)  

cosine_output = cosine_similarity(temp_file, tfidf.transform(stem_word_file))  

```

# Testing the function

---

```python

search_data('python')

```



**Result of similar documents for word "Python".** 

# Conclusion

---

The search engine at the current stage has very limited capability.

Using a vector encoder model would provide sematic search results that are similar in meaning while TF-IDF model doesn't understand words.

ecosyste.ms

Data

Tools

Indexes

Applications

Experiments

Awesome

https://github.com/jeffrine/inverted-index-search-engine

Awesome Lists containing this project

README