https://github.com/ideavision/vc_assistant
Venture Capital firms Analyzer with LLM
https://github.com/ideavision/vc_assistant
ai api fastapi generative-ai gpt langchain llm nlp python qdrant rag scraper-api vector-database venture-capital
Last synced: 3 months ago
JSON representation
Venture Capital firms Analyzer with LLM
- Host: GitHub
- URL: https://github.com/ideavision/vc_assistant
- Owner: ideavision
- Created: 2024-05-02T13:22:58.000Z (almost 2 years ago)
- Default Branch: master
- Last Pushed: 2024-05-02T13:39:30.000Z (almost 2 years ago)
- Last Synced: 2025-07-06T00:07:04.044Z (7 months ago)
- Topics: ai, api, fastapi, generative-ai, gpt, langchain, llm, nlp, python, qdrant, rag, scraper-api, vector-database, venture-capital
- Language: Python
- Homepage:
- Size: 1.87 MB
- Stars: 1
- Watchers: 1
- Forks: 0
- Open Issues: 0
-
Metadata Files:
- Readme: README.md
Awesome Lists containing this project
README
# 🤖 VC_assistant (Venture Capital Assistant)
# Overview
VC_assistant is an adaptive intelligence chatbot designed to facilitate the creation of enterprise applications with advanced conversational capabilities for Venture Capital info scraping , embedding , storing in vectordatabase, extract special content in JSON format and find similarity. It leverages a tool-using agentic architecture and retrieval augmented generation (RAG) to integrate state-of-the-art open-source frameworks such as LangChain, FastAPI and etc. This document outlines the architecture, components, and use cases of VC_Assistant, providing a comprehensive understanding of its functionality and deployment.
# Main Features:
### Given VC website URL as an input by the user, your Generative AI assistant should be able to:

## Scrape Home page of the website and store the content in vectorDB of your choice,


## Extract the following information and show it to the user as a JSON object: VC name,contacts, industries that they invest in, investment rounds that they participate/lead.
## Compare VC website content to the other VC website contents in the database.

# Prerequisites
To use this project, you will need:
- Docker and Docker Compose installed
- Python 3.7+
- An OpenAI API key
# Setup
To set up the project:
1. Clone this repository to your local machine.
2. Rename `key.env.example` to `key.env` and add your OpenAI API key.
3. Review `config.yml` and choose `openai` or `local` for your `Embedding_Type`
4. In `docker-compose.yml`, update the `volumes` path for `VC_ASSISTANT_QDRANT` to a local folder where you want persistent storage for the vector database.
5. Create needed directories for persistant storage
```bash
mkdir -p .data/qdrant/
```
6. Build the Docker images:
```bash
docker-compose build
```
7. Start the services:
```bash
docker-compose up -d
```
The services will now be running.
## **Deployment and Usage**
Once the Docker containers are up and running, you can start interacting::
- The **interactive Swagger docs** at [http://localhost:8000/docs](http://localhost:8000/docs)
- The **Qdrant Web Interface** at [http://localhost:6333/dashboard](http://localhost:6333/dashboard)
### Build the default Collection
1. **Scrape Documents:**
- Go to the FastAPI server by navigating to the interactive Swagger docs at [http://localhost:8000/docs](http://localhost:8000/docs).
- Use the `scraper` endpoint to scrape content from a specified URL. You will need to provide the URL you want to scrape in the request body.
- The `scraper` endpoint will return the scraped content which will be processed in the next step.
2. **Create a Vector Index:**
- Use the `process-scrape2vector` endpoint to create a vector index from the scraped content.
- In case the `default` collection does not exist, the `process-scrape2vector` endpoint will create one for you.
- This endpoint will process the scraped documents, create a vector index, and load it into Qdrant.
## Architecture Overview
The VC_Assistant architecture consists of the following key components:
- FastAPI - High performance REST API framework. Handles HTTP requests and routes them to application logic.
- Qdrant - Vector database for storing document embeddings and enabling similarity search.
- AgentHandler - Orchestrates the initialization and execution of the conversational agent.
- Scraper - A tool that scrapes a web page and converts it to markdown.
- Loader - A tool that loads content from the scrped_data directory to a VectorStoreIndex
- Tools - Custom tools that extend the capabilities of the agent.
## Infrastructure
Let's take a closer look at some of the key VC_assistant infrastructure components:
### FastAPI
FastAPI provides a robust web framework for handling the API routes and HTTP requests/responses.
Some key advantages:
- Built on modern Python standards like type hints and ASGI.
- Extremely fast - benchmarked to be faster than NodeJS and Go.
- Automatic interactive docs using OpenAPI standards.
### Qdrant
Qdrant is a vector database optimized for ultra-fast similarity search across large datasets. It is used in this project to store and index document embeddings, enabling the bot to quickly find relevant documents based on a search query or conversation context.
### Document Scraping Section
The `scraper` module, located in `/app/src/scraper/scraper_main.py`, serves as a robust utility for extracting content from web pages and converting it into structured markdown format. This module is integral for enabling the framework to access and utilize information from a plethora of web sources. Below is a succinct overview focusing on its core functionalities and workflow for developers aiming to integrate and leverage this module effectively.
#### Components:
- **WebScraper Class:**
- Inherits from the base Scraper class and implements the Singleton pattern to ensure a unique instance.
- Orchestrates the entire scraping process, from fetching to parsing, and saving the content.
- Leverages `ContentParser` to extract and convert meaningful data from HTML tags into markdown format.
- **ContentParser Class:**
- Designed to parse and convert meaningful content from supported HTML tags into markdown format.
- Supports a variety of HTML tags including paragraphs, headers, list items, links, inline code, and code blocks.
#### Workflow:
1. **URL Validation:**
- The provided URL undergoes validation to ensure its correctness and accessibility.
- If the URL is invalid, the process is terminated, and an error message is logged.
2. **Content Fetching:**
- Content from the validated URL is fetched using HTTP requests.
- Utilizes random user agents to mimic genuine user activity and avoid potential blocking by web servers.
- If the content fetching fails, the process is halted, and an error message is logged.
3. **Content Parsing:**
- The fetched content is parsed using BeautifulSoup, and the `ContentParser` class is employed to extract meaningful data.
- The parsed data includes the title, metadata, and the content in markdown format.
4. **File Saving:**
- The parsed content is saved to a file, the filename is generated using a hash of the URL.
- The file is stored in a pre-configured data directory.
- If the file saving fails, an error message is logged.
5. **Result Return:**
- Upon the successful completion of the scraping process, a success message and the filepath of the saved content are returned.
- If any step in the process fails, an appropriate error message is returned.
#### Usage:
Developers can initiate the scraping process by invoking the `run_web_scraper(url)` function with the desired URL. This function initializes a `WebScraper` instance and triggers the scraping process, returning a dictionary containing the outcome of the scraping process, including messages indicating success or failure and the location where the scraped data has been saved.
#### Example:
```python
result = run_web_scraper("http://example.com")
if result and result.get("message") == "Scraping completed successfully":
print(f"Scraping complete! Saved to {result['data']}")
else:
print(result["message"])
```