https://github.com/finntegrate/tapio
A chatbot with specific information for Finnish immigrants.
https://github.com/finntegrate/tapio
agent chatbot finland generative-ai gradio gradio-python-llm immigration ollama retrieval-augmented-generation
Last synced: 8 months ago
JSON representation
A chatbot with specific information for Finnish immigrants.
- Host: GitHub
- URL: https://github.com/finntegrate/tapio
- Owner: finntegrate
- License: eupl-1.2
- Created: 2025-04-15T05:34:43.000Z (10 months ago)
- Default Branch: main
- Last Pushed: 2025-05-31T09:56:26.000Z (9 months ago)
- Last Synced: 2025-06-02T12:40:51.857Z (9 months ago)
- Topics: agent, chatbot, finland, generative-ai, gradio, gradio-python-llm, immigration, ollama, retrieval-augmented-generation
- Language: Python
- Homepage: https://finntegrate.org
- Size: 717 KB
- Stars: 0
- Watchers: 0
- Forks: 2
- Open Issues: 26
-
Metadata Files:
- Readme: README.md
- Contributing: CONTRIBUTING.md
- License: LICENSE
Awesome Lists containing this project
README
# Tapio
[](#contributors-)
Tapio is a RAG (Retrieval Augmented Generation) tool for extracting, processing, and querying information from websites like Migri.fi (Finnish Immigration Service). It provides complete workflow capabilities including web crawling, content parsing, vectorization, and an interactive chatbot interface.
## Features
- **Multi-site support** - Configurable site-specific crawling and parsing
- **End-to-end pipeline** - Crawl → Parse → Vectorize → Query workflow
- **Local LLM integration** - Uses Ollama for private, local inference
- **Semantic search** - ChromaDB vector database for relevant content retrieval
- **Interactive chatbot** - Web interface for natural language queries
- **Flexible crawling** - Configurable depth and domain restrictions
- **Comprehensive testing** - Full test suite for reliability
## Target Use Cases
**Primary Users:** EU and non-EU citizens navigating Finnish immigration processes
- Students seeking education information
- Workers exploring employment options
- Families pursuing reunification
- Refugees and asylum seekers needing guidance
**Core Needs:**
- Finding relevant, accurate information quickly
- Practice conversations on specific topics (family reunification, work permits, etc.)
## Installation and Setup
### Prerequisites
- Python 3.10 or higher
- [uv](https://github.com/astral-sh/uv) - Fast Python package installer
- [Ollama](https://ollama.ai/) - For local LLM inference
### Quick Start
1. Clone and setup:
```bash
git clone https://github.com/Finntegrate/tapio.git
cd tapio
uv sync
```
2. Install required Ollama model:
```bash
ollama pull llama3.2
```
## Usage
### CLI Overview
Tapio provides a four-step workflow:
1. **crawl** - Collect HTML content from websites
2. **parse** - Convert HTML to structured Markdown
3. **vectorize** - Create vector embeddings for semantic search
4. **tapio-app** - Launch the interactive chatbot interface
Use `uv run -m tapio.cli --help` to see all commands or `uv run -m tapio.cli --help` for command-specific options.
### Quick Example
Complete workflow for the Migri website:
```bash
# 1. Crawl content (uses site configuration)
uv run -m tapio.cli crawl migri --depth 2
# 2. Parse HTML to Markdown
uv run -m tapio.cli parse migri
# 3. Create vector embeddings
uv run -m tapio.cli vectorize
# 4. Launch chatbot interface
uv run -m tapio.cli tapio-app
```
### Available Sites
To list configured sites:
```bash
uv run -m tapio.cli list-sites
```
To view detailed site configurations:
```bash
uv run -m tapio.cli list-sites --verbose
```
## Site Configurations
Site configurations define how to crawl and parse specific websites. They're stored in `tapio/config/site_configs.yaml` and used by both crawl and parse commands.
### Configuration Structure
```yaml
sites:
migri:
base_url: "https://migri.fi" # Used for crawling and converting relative links
description: "Finnish Immigration Service website"
crawler_config: # Crawling behavior
delay_between_requests: 1.0 # Seconds between requests
max_concurrent: 3 # Concurrent request limit
parser_config: # Parser-specific configuration
title_selector: "//title" # XPath for page titles
content_selectors: # Priority-ordered content extraction
- '//div[@id="main-content"]'
- "//main"
- "//article"
- '//div[@class="content"]'
fallback_to_body: true # Use if selectors fail
markdown_config: # HTML-to-Markdown options
ignore_links: false
body_width: 0 # No text wrapping
protect_links: true
unicode_snob: true
ignore_images: false
ignore_tables: false
```
### Required vs Optional Fields
**Required:**
- `base_url` - Base URL for the site (used for crawling and link resolution)
**Optional (with defaults):**
- `description` - Human-readable description
- `parser_config` - Parser-specific settings (uses defaults if omitted)
- `title_selector` - Page title XPath (default: "//title")
- `content_selectors` - XPath selectors for content extraction (default: ["//main", "//article", "//body"])
- `fallback_to_body` - Use full body content if selectors fail (default: true)
- `markdown_config` - HTML conversion settings (uses defaults if omitted)
- `crawler_config` - Crawling behavior settings (uses defaults if omitted)
- `delay_between_requests` - Delay between requests in seconds (default: 1.0)
- `max_concurrent` - Maximum concurrent requests (default: 5)
### Adding New Sites
1. Analyze the target website's structure
2. Identify XPath selectors for content extraction
3. Add configuration to `site_configs.yaml`:
```yaml
sites:
my_site:
base_url: "https://example.com"
description: "Example site configuration"
parser_config:
content_selectors:
- '//div[@class="main-content"]'
```
4. Use with commands:
```bash
uv run -m tapio.cli crawl my_site
uv run -m tapio.cli parse my_site
uv run -m tapio.cli vectorize
uv run -m tapio.cli tapio-app
```
## Configuration
Tapio uses centralized configuration in `tapio/config/settings.py`:
```python
DEFAULT_DIRS = {
"CRAWLED_DIR": "content/crawled", # HTML storage
"PARSED_DIR": "content/parsed", # Markdown storage
"CHROMA_DIR": "chroma_db", # Vector database
}
DEFAULT_CHROMA_COLLECTION = "tapio" # ChromaDB collection name
```
Site-specific configurations are in `tapio/config/site_configs.yaml` and automatically handle content extraction and directory organization based on the site's domain.
## Contributing
See [CONTRIBUTING.md](CONTRIBUTING.md) for development guidelines, code style requirements, and how to submit pull requests.
## License
Licensed under the European Union Public License version 1.2. See LICENSE for details.
## Contributors ✨
Thanks goes to these wonderful people ([emoji key](https://allcontributors.org/docs/en/emoji-key)):

Brylie Christopher Oxley
🚇 ⚠️ 📖 🐛 💼 🖋 🤔 🚧 🧑🏫 📆 📣 🔬 👀 💻

AkiKurvinen
🔣 💻

ResendeTech
💻
This project follows the [all-contributors](https://github.com/all-contributors/all-contributors) specification. Contributions of any kind welcome!