An open API service indexing awesome lists of open source software.

https://github.com/preethi2805/rag_system_for_chatting_with_documents

This project implements a financial research assistant tool that extracts insights from a PDF Report by generating augmented queries and performing semantic document search.
https://github.com/preethi2805/rag_system_for_chatting_with_documents

langchain python query-expansion retrieval-augmented-generation

Last synced: about 2 months ago
JSON representation

This project implements a financial research assistant tool that extracts insights from a PDF Report by generating augmented queries and performing semantic document search.

Awesome Lists containing this project

README

          

# Financial Research Assistant with Augmented Query Expansion

## Project Description

This project builds a financial research assistant capable of answering complex finance-related questions by extracting insights from financial reports. The system leverages Natural Language Processing (NLP), machine learning models, and a combination of text preprocessing, embeddings, and query expansion to deliver precise and contextually relevant answers to user queries.

The project is designed to handle reports and allows for detailed analysis through:

- Text extraction from PDF reports.
- Splitting long texts into chunks for easier processing.
- Converting text into embeddings (vector representations) using pre-trained models.
- ChromaDB is used to store and query embeddings.
- Generating augmented queries to improve search results and answer relevance.
- Visualizing document embeddings in a lower-dimensional space using UMAP.

## Features

- **Text Extraction & Preprocessing**: Extracts and cleans the text from PDF documents, filters out empty text, and splits large texts into manageable chunks.

- **Embedding & Vector Search**: Embeds the text into vector space using Sentence-Transformers, then stores and queries this data using ChromaDB for semantic search.

- **Augmented Query Expansion**: Enhances user queries by generating related questions to refine results and improve the search process.

- **Visualizing Document Embeddings**: Projects high-dimensional embeddings into a 2D space using UMAP, enabling a clear visualization of document clusters and query relevance.

## Requirements

To run the project, you'll need the following dependencies:

- `pypdf` - For extracting text from PDF files
- `chromadb` - For storing and querying document embeddings
- `groq` - To interact with Groq's API for generating augmented queries
- `langchain` - For text splitting and processing
- `matplotlib` - For visualizing the results in 2D space
- `numpy` - For numerical operations
- `umap-learn` - For dimensionality reduction

## Project Output

- **Query Results**: The system will output answers based on the retrieved documents from the ChromaDB query.
- **Augmented Queries**: A list of related questions that are generated to help refine the user's original query.

Here are images that show the actual **retrieved results** and **augmented queries** used in the project.

#### Retrieved Results
![Retrieved Results](/hypothetical_answer.png)

#### Augmented Queries
![Augmented Queries](/Augumented_queries.png)

This image shows the augmented queries that were generated by rephrasing the original query. The goal of these augmented queries is to explore different ways of asking the same or similar questions, expanding the range of possible relevant results.

- **Visualizations**: The script will generate a plot that visualizes the embeddings of the documents, original query, augmented queries, and retrieved answers in a 2D space using UMAP.

#### Scatter Plot Analysis

The scatter plot below visualizes the document embeddings in a 2D space after applying UMAP dimensionality reduction. Each point represents an embedding of either a document, a query, or an augmented query. The plot allows us to assess the relationships between the documents in the collection and how well the queries (both original and augmented) align with the relevant documents.

![alt text](/Figure_1.png)

- **Blue dots** represent the embeddings of the document chunks from the PDF, visualizing their distribution in the reduced space.
- **Red "X"** represents the embedding of the original query, showing where it lies relative to the documents.
- **Orange "X"** represents the augmented queries, which are alternative phrasings of the original query.
- **Green circle (hollow)** represents the embeddings of the most relevant documents retrieved for the query.

#### Interpretation:
- The closer the red "X" is to the blue dots, the more relevant the documents are to the original query.
- The orange "X" markers show the results of query expansion, and ideally, they should cluster near the original query.
- The green circles indicate the retrieved documents that are considered most relevant based on semantic similarity.

## Future Improvements

- Integration of more sophisticated query generation and retrieval mechanisms.
- Deployment of the model as a web app for real-time use.

---