https://github.com/finntegrate/migri-assistant
A chatbot with specific information from the Migri.fi website.
https://github.com/finntegrate/migri-assistant
agent chatbot finland generative-ai gradio gradio-python-llm immigration ollama retrieval-augmented-generation
Last synced: 10 months ago
JSON representation
A chatbot with specific information from the Migri.fi website.
- Host: GitHub
- URL: https://github.com/finntegrate/migri-assistant
- Owner: finntegrate
- License: apache-2.0
- Created: 2025-04-15T05:34:43.000Z (10 months ago)
- Default Branch: main
- Last Pushed: 2025-04-23T02:51:55.000Z (10 months ago)
- Last Synced: 2025-04-23T16:16:58.888Z (10 months ago)
- Topics: agent, chatbot, finland, generative-ai, gradio, gradio-python-llm, immigration, ollama, retrieval-augmented-generation
- Language: Python
- Homepage:
- Size: 530 KB
- Stars: 0
- Watchers: 0
- Forks: 0
- Open Issues: 1
-
Metadata Files:
- Readme: README.md
- License: LICENSE
Awesome Lists containing this project
README
# Migri Assistant
## Overview
Migri Assistant is a tool designed to extract, process, and query information from websites, with specific functionality tailored for the Migri.fi website. It provides end-to-end RAG (Retrieval Augmented Generation) capabilities including web crawling, content parsing, vectorization, and an interactive chatbot interface.
## Key demographics
- EU citizens
- Non-EU citizens
- Target Audience
- Students
- Workers
- Families
- Refugees
- Asylum Seekers
## Needs
- Finding relevant information
- Conversation practice based on the topics they search (e.g. family reunification, work, studies)
## Features
- Crawls web pages to a configurable depth
- Saves raw HTML content with domain-based organization
- Parses HTML content into structured Markdown files
- Vectorizes parsed content into ChromaDB for semantic search
- Provides a Gradio-based RAG chatbot interface for querying content
- Integrates with Ollama for local LLM inference
- Clean separation between crawling, parsing, vectorization, and querying
- Domain restriction and crawl depth control
- Comprehensive test suite
## Installation and Setup
### Prerequisites
- Python 3.10 or higher
- [uv](https://github.com/astral-sh/uv) - Fast Python package installer and resolver
- [Ollama](https://ollama.ai/) - For local LLM inference (required for the chatbot)
### Setting up with uv
1. Clone the repository:
```bash
git clone https://github.com/Finntegrate/migri-assistant.git
cd migri-assistant
```
2. Create and activate a virtual environment with uv:
```bash
uv venv
source .venv/bin/activate # On Unix/macOS
# OR
.\.venv\Scripts\activate # On Windows
```
3. Install dependencies:
```bash
uv sync --dev
```
4. Ensure you have the required Ollama models:
```bash
ollama pull llama3.2
```
## Usage
### Running the Crawler, Parser, and Vectorizer
The tool follows a three-step process to crawl, parse, and vectorize content:
1. **Crawl** a website to retrieve and save HTML content:
```bash
uv run -m migri_assistant.cli crawl https://migri.fi/en/home --depth 2 --output-dir crawled_content
```
2. **Parse** the HTML content into structured Markdown:
```bash
uv run -m migri_assistant.cli parse --input-dir crawled_content --output-dir parsed_content
```
3. **Vectorize** the parsed Markdown content into ChromaDB for semantic search:
```bash
uv run -m migri_assistant.cli vectorize --input-dir parsed_content --db-dir chroma_db --collection migri_docs
```
4. **Launch the RAG Chatbot** to interactively query the content:
```bash
uv run -m migri_assistant.cli gradio-app
```
### RAG Chatbot Options
The RAG chatbot allows you to query information from your vectorized content using a local LLM through Ollama. The chatbot provides several configuration options:
```bash
# Quick start - launch with development server
uv run -m migri_assistant.cli dev
# Long form - launch with default settings
uv run -m migri_assistant.cli gradio-app
# Use a specific Ollama model
uv run -m migri_assistant.cli gradio-app --model-name llama3.2:latest
# Specify a different ChromaDB collection
uv run -m migri_assistant.cli gradio-app --collection-name my_collection
# Create a shareable link for the app
uv run -m migri_assistant.cli gradio-app --share
```
### Parameters and Options
For detailed information about available parameters and options for any command:
```bash
uv run -m migri_assistant.cli --help
```
Available commands:
- `crawl`: Crawl websites and save HTML content
- `parse`: Parse HTML files into structured Markdown
- `vectorize`: Vectorize parsed Markdown into ChromaDB
- `gradio-app`: Launch the Gradio RAG chatbot interface
- `info`: Show information about available commands
## Development
### Code Quality
We use [Ruff](https://docs.astral.sh/ruff/) for linting and formatting. To run the linter:
```bash
uv run ruff .
```
To automatically fix issues:
```bash
uv run ruff . --fix
```
To check formatting without fixing:
```bash
uv run ruff . --check
```
### Running Tests
```bash
uv run pytest
```
To run tests with code coverage reports:
```bash
# Generate coverage report in the terminal
uv run pytest --cov=migri_assistant
# Generate HTML coverage report
uv run pytest --cov=migri_assistant --cov-report=html
# Get coverage for specific modules
uv run pytest --cov=migri_assistant.utils tests/utils/
```
The HTML coverage report will be generated in the `htmlcov` directory. Open `htmlcov/index.html` in your browser to view it.
## Project Structure
The project has been designed with a clear separation of concerns:
- `crawler/`: Module responsible for crawling websites and saving HTML content
- `parsers/`: Module responsible for parsing HTML content into structured formats
- `vectorstore/`: Module responsible for vectorizing content and storing in ChromaDB
- `gradio_app.py`: Gradio interface for the RAG chatbot
- `utils/`: Utility modules for embedding generation, markdown processing, etc.
- `tests/`: Test suite for all modules
## License
This project is licensed under the Apache 2.0 License. See the LICENSE file for more details.