Ecosyste.ms: Awesome

An open API service indexing awesome lists of open source software.

Awesome Lists | Featured Topics | Projects

https://github.com/ashad001/ir-indexing

CS4051 - Information Retrieval Course Assignment
https://github.com/ashad001/ir-indexing

ai boolean-model indexing information-retrieval inverted-index kmeans knn python react streamlit vectorspacemodel

Last synced: 22 days ago
JSON representation

CS4051 - Information Retrieval Course Assignment

Host: GitHub
URL: https://github.com/ashad001/ir-indexing
Owner: Ashad001
Created: 2024-02-19T08:34:28.000Z (12 months ago)
Default Branch: main
Last Pushed: 2024-05-07T07:37:44.000Z (9 months ago)
Last Synced: 2024-11-11T05:07:23.407Z (3 months ago)
Topics: ai, boolean-model, indexing, information-retrieval, inverted-index, kmeans, knn, python, react, streamlit, vectorspacemodel
Language: Python
Homepage:
Size: 35.6 MB
Stars: 1
Watchers: 1
Forks: 0
Open Issues: 0
Metadata Files:
- Readme: README.md

Awesome Lists containing this project

README

# Information Retreival Using Vector Space Model

This Flask app facilitates Information Retrieval using various indexing techniques and incorporates a React frontend for enhanced user interaction. Additionally, documents are ranked using the Vector Space Model based on TF-IDF (Term Frequency-Inverse Document Frequency). The project structure is outlined below, providing an overview of the organization and key components.

## Project Structure

```plaintext
indexer/
│
├── api/
│ ├── data/
│ │ ├── ResearchPapers/
│ │ │ ├── (Research papers files)
│ │ └── Stopword-List.txt
│ ├── docs/
│ ├── logs/
│ ├── src/
│ │ ├── indexer/
│ │ ├── indexes/
│ │ ├── models/
│ │ ├── processing/
│ │ ├── vocab/
│ │ ├── __pycache__/
│ │ ├── __init__.py
│ │ ├── logger.py
│ │ ├── retreival.py
│ │ └── utils.py
│ ├── .flaskenv
│ ├── app.py
│ ├── README.md
│ └── requirements.txt
│
├── node_modules/
├── public/
├── src/
│ ├── App.css
│ ├── App.js
│ ├── App.test.js
│ ├── index.css
│ ├── index.js
│ ├── logo.svg
│ ├── reportWebVitals.js
│ └── setupTests.js
│
├── .gitignore
├── package-lock.json
├── package.json
└── README.md

```

## Project Components

- **`data/`:** Contains research papers in the `ResearchPapers/` directory and a file `Stopword-List.txt` with common stop words.
- Simply place new files in this folder, and the app will automatically index them.
- Don't Remove `Stopword-List.txt` as it is used for stop word removal, though you can update the .txt file manually.

- **`flows/`:** Contains diagrams and drawings illustrating data flows and UI design.

- **`src/`:** Contains the source code for the app and various modules for indexing and retrieval.

- **`static/`:** Includes JavaScript (`script.js`) and CSS (`styles.css`) files for static content.

- **`templates/`:** Contains HTML template for rendering pages.

- **`tests/`:** Contains unit tests with corresponding test sets for various functionalities.

- **`tests/test_sets:`** Add your test sets in the files
- `golden_boolean_queries.txt`
- `golden_proximity_queries.txt`
- Enter Queries of the form:
- Example Query: TOUR_QUERY
- Result-Set: EXPECTED_RESULTS

- **`app.py`:** The main Flask application file.

## Flow & Design
### Data Flow
![Data Flow](flows/dataflow.png)

### UI design
![UI](flows/ui.png)

## Functionality

The app offers efficient retrieval capabilities, emphasizing performance and user experience.

### Index Generation and Metadata Logging

- Index generation occurs at the beginning and is only performed once, saving indexes to files.
- Metadata, including information about file structure and indexes, is logged for future reference.
- If indexes are requested again, the app checks for changes in data and regenerates only the necessary indexes.

### Performance Logging

- Essential performance metrics are logged, providing insights into processing times for index formation, search operations, and more.
- This information helps in monitoring and optimizing the efficiency of the retrieval system.

### Query Processing

- The app prompts users to enter queries, whether boolean or proximity-based.
- The algorithm determines the query type and performs the search accordingly.
- Suggestions for words are provided to users, enhancing the query input experience.
- Trie-based searching is employed for efficient and fast word suggestions.

### Search Results Presentation

- Documents are ranked using the Vector Space Model based on TF-IDF scores.
- If documents match the user's query, the app presents the corresponding document IDs along with their relevance scores.
- In the absence of matching documents, the app attempts to correct the query using Levenshtein distance on a word-by-word basis.
- The corrected query is presented to the user, and if the original and corrected queries are identical, the user is informed that no documents match the query.
- Each document in the search result is accompanied by a static summary. Hovering over the document displays its rank/score.

### Logging User Interaction

- The app logs important information about user queries, errors, and search results.
- This logging allows for a comprehensive review of user interactions, aiding in system analysis and improvement.

The combined features ensure a seamless and efficient experience for users interacting with the IR-Indexing app, promoting effective information retrieval and user-friendly query processing.

## Running the Project

To run the project, follow these steps:

1. Set up a Python environment and install dependencies:

```bash
cd api
python -m venv venv
venv\Scripts\activate
```

```bash
pip install -r requirements.txt
```

2. Run the Flask app:

```bash
cd .. (to go back to the root directory)
npm install
yarn start-api
```

3. Open a web browser and navigate to `http://127.0.0.1:5000/` to interact with the app.

For the React frontend:

1. Start the React development server:

```bash
yarn start
```

2. The React app will be running on `http://localhost:3000/` by default.

## Acknowledgements
- The Porter Stemmer implementation is based on the original algorithm by Martin Porter.
- Source: [https://vijinimallawaarachchi.com/2017/05/09/porter-stemming-algorithm/](https://vijinimallawaarachchi.com/2017/05/09/porter-stemming-algorithm/)
- GitHub Repository: [https://github.com/jedijulia/porter-stemmer/blob/master/stemmer.py](https://github.com/jedijulia/porter-stemmer/)

- The Levenshtein distance algorithm is based on the original algorithm by Vladimir Levenshtein.
- Source: [https://en.wikipedia.org/wiki/Levenshtein_distance](https://en.wikipedia.org/wiki/Levenshtein_distance)