{"id":26086009,"url":"https://github.com/bigsk1/supa-crawl-chat","last_synced_at":"2026-05-15T23:12:44.244Z","repository":{"id":281418571,"uuid":"945221252","full_name":"bigsk1/supa-crawl-chat","owner":"bigsk1","description":"Integrates Supabase with Crawl4AI and AI Chat to create a powerful web crawling and semantic search solution. Streamlit supabase data visualization. Run all in Docker. API and more! ","archived":false,"fork":false,"pushed_at":"2026-05-09T01:29:57.000Z","size":2008,"stargazers_count":28,"open_issues_count":1,"forks_count":5,"subscribers_count":4,"default_branch":"main","last_synced_at":"2026-05-09T03:34:23.713Z","etag":null,"topics":["crawl4ai","crawler","docker","embeddings","fastapi","gpt-4o","openai-api","pgvector","postgresql","scraping","streamlit","supabase"],"latest_commit_sha":null,"homepage":"https://github.com/bigsk1/supa-crawl-chat/wiki/Supa-%E2%80%90-Crawl-%E2%80%90-Chat-Wiki","language":"Python","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":"mit","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/bigsk1.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":".github/FUNDING.yml","license":"LICENSE.md","code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":"security_utils.py","support":null,"governance":null,"roadmap":null,"authors":null,"dei":null,"publiccode":null,"codemeta":null,"zenodo":null,"notice":null,"maintainers":null,"copyright":null,"agents":null,"dco":null,"cla":null},"funding":{"github":"bigsk1"}},"created_at":"2025-03-08T23:39:21.000Z","updated_at":"2026-05-09T01:29:54.000Z","dependencies_parsed_at":null,"dependency_job_id":"949c6f4e-616f-4183-b88a-f8b654f38ae8","html_url":"https://github.com/bigsk1/supa-crawl-chat","commit_stats":null,"previous_names":["bigsk1/supa-crawl-chat"],"tags_count":0,"template":false,"template_full_name":null,"purl":"pkg:github/bigsk1/supa-crawl-chat","repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/bigsk1%2Fsupa-crawl-chat","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/bigsk1%2Fsupa-crawl-chat/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/bigsk1%2Fsupa-crawl-chat/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/bigsk1%2Fsupa-crawl-chat/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/bigsk1","download_url":"https://codeload.github.com/bigsk1/supa-crawl-chat/tar.gz/refs/heads/main","sbom_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/bigsk1%2Fsupa-crawl-chat/sbom","scorecard":null,"host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":286080680,"owners_count":33082979,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2026-05-15T20:25:35.270Z","status":"ssl_error","status_checked_at":"2026-05-15T20:25:34.732Z","response_time":103,"last_error":"SSL_connect returned=1 errno=0 peeraddr=140.82.121.5:443 state=error: unexpected eof while reading","robots_txt_status":"success","robots_txt_updated_at":"2025-07-24T06:49:26.215Z","robots_txt_url":"https://github.com/robots.txt","online":false,"can_crawl_api":true,"host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["crawl4ai","crawler","docker","embeddings","fastapi","gpt-4o","openai-api","pgvector","postgresql","scraping","streamlit","supabase"],"created_at":"2025-03-09T06:33:23.884Z","updated_at":"2026-05-15T23:12:44.231Z","avatar_url":"https://github.com/bigsk1.png","language":"Python","funding_links":["https://github.com/sponsors/bigsk1"],"categories":[],"sub_categories":[],"readme":"[![Python application](https://github.com/bigsk1/supa-crawl-chat/actions/workflows/python-app.yml/badge.svg)](https://github.com/bigsk1/supa-crawl-chat/actions/workflows/python-app.yml)\n![Docker support](https://img.shields.io/badge/docker-supported-blue)\n[![License](https://img.shields.io/github/license/bigsk1/supa-crawl-chat)](https://github.com/bigsk1/supa-crawl-chat/blob/main/LICENSE)\n\n# 🚀 Supa-Crawl-Chat\n\nIntroducing Supa-Crawl-Chat: A Comprehensive Web Crawling, Semantic Search, and AI-Driven Chat Solution with Supabase \u0026 Crawl4AI.\n\nSeamlessly crawl websites, transform content into vector embeddings, and enable advanced semantic search. Supa-Crawl-Chat utilizes Supabase for reliable data storage and incorporates AI-powered chat with long-term memory features.\n\n![crawl](https://imagedelivery.net/WfhVb8dSNAAvdXUdMfBuPQ/cbbb5f3b-d089-49a1-704f-c2ebd6bcef00/public)\n\n\n## ✨ Key Features\n\n- 🕷️ **High-Performance Web Crawling**\n  - Harness the power of Crawl4AI to efficiently index websites and sitemaps with configurable depth and scope\n  - Advanced crawling algorithms adapt to different website structures and content types for optimal data extraction\n  - Seamless handling of JavaScript-rendered content and dynamic websites\n\n- 🔍 **Advanced Semantic Search Engine**\n  - Leverage cutting-edge vector similarity and OpenAI embeddings for context-aware search capabilities\n  - Achieve up to 95% more relevant search results compared to traditional keyword-based approaches\n  - Fine-tuned ranking algorithms that understand semantic relationships between concepts\n\n- 📝 **AI-Powered Content Intelligence**\n  - Transform raw web content into structured, actionable data using terminal or UI.\n  - Generate human-quality titles, summaries, and site descriptions with remarkable accuracy\n  - Automatic content categorization and entity extraction for enhanced data organization\n\n- 📊 **Interactive Data Visualization**\n  - Explore your data ecosystem through an intuitive Streamlit-based interface\n  - Real-time analytics and insights into your content repository\n  - Customizable dashboards for monitoring crawl performance and content metrics\n\n- 🐳 **Scalable Deployment Architecture**\n  - Deploy with confidence using our Docker configurations:\n    - **Lightweight**: App-only deployment for integration with existing infrastructure\n    - **Standard**: App + Crawl4AI for complete content processing capabilities\n    - **Full-Stack**: End-to-end solution with App + Crawl4AI + Supabase for maximum autonomy\n\n- 🌐 **Comprehensive API Ecosystem**\n  - RESTful API with interactive OpenAPI docs (`/docs`); integration guide: **[docs/API.md](docs/API.md)**\n  - Optional authentication: `SUPA_API_AUTH` / `SUPA_API_KEY`, legacy `SCC_API_KEYS` / `API_KEYS`, and optional WebUI password + JWT (`WEBUI_PASSWORD`)\n  - Crawl URL validation (SSRF mitigation) with optional private-network overrides for trusted deployments\n\n\n## Prerequisites\n\n- Python 3.10+\n- Node 18+\n- A running Crawl4AI instance (self-hosted or provided)\n- A Supabase instance (self-hosted or provided)\n- OpenAI API key for generating embeddings, content summaries and chat\n- Docker (optional)\n\n## Installation\n\n1. Clone this repository:\n   ```\n   git clone https://github.com/bigsk1/supa-crawl-chat.git\n   cd supa-crawl-chat\n   ```\n\n2. Install the required dependencies:\n   ```\n   pip install -r requirements.txt\n   ```\n\n3. Changed directory to frontend and install dependencies:\n\n   ```bash\n   cd frontend\n   ```\n\n   ```bash\n   npm install\n   ```\n\n\n4. Create a `.env` file with your configuration:\n\u003cdetails\u003e\n\u003csummary\u003eClick to expand\u003c/summary\u003e\n\n```env\n# Crawl4AI Configuration\n# Locally ran in docker or external service - easily setup with docker compose\nCRAWL4AI_API_TOKEN=your_crawl4ai_api_token\n\n# Local Docker\n# CRAWL4AI_BASE_URL=http://crawl4ai:11235 \n# External Service \nCRAWL4AI_BASE_URL=your_crawl4ai_base_url \n\n# Supabase Configuration\nSUPABASE_URL=your_supabase_host:port\n# Database credentials\nSUPABASE_DB=postgres\nSUPABASE_KEY=postgres\nSUPABASE_PASSWORD=postgres\n\n# OpenAI Configuration\nOPENAI_API_KEY=sk-proj-\n# Model to use for embeddings\nOPENAI_EMBEDDING_MODEL=text-embedding-3-small\n# Model to use for title and summary generation and chat analysis\nOPENAI_CONTENT_MODEL=gpt-4o-mini\n\n# Crawl Configuration\n# Set to 'url' for regular website or 'sitemap' for sitemap crawling, will crawl child pages from the sitemap\nCRAWL_TYPE=url\n# URL to crawl (can be a website URL or sitemap URL)\nCRAWL_URL=https://example.com\n# Maximum number of URLs to crawl from a sitemap (set to 0 for unlimited)\nMAX_URLS=30\n# Optional name for the site (if not provided, one will be generated)\nCRAWL_SITE_NAME=\n# Optional description for the site (if not provided, one will be generated)\nCRAWL_SITE_DESCRIPTION=\n\n# Chat Configuration\n# Model to use for the chat interface\nCHAT_MODEL=gpt-4o\n# Number of results to retrieve for each query\nCHAT_RESULT_LIMIT=5\n# Similarity threshold for vector search (0-1)\nCHAT_SIMILARITY_THRESHOLD=0.4\n# Default session ID (if not provided, a new one will be generated) you can use a random string\nCHAT_SESSION_ID=\n# Default user ID (optional, name, user, i.e. larry)\nCHAT_USER_ID=\n# Default chat profile (default, pydantic, technical, concise, scifi, pirate, supabase_expert, medieval, etc.)\nCHAT_PROFILE=default\n# Directory containing profile YAML files\nCHAT_PROFILES_DIR=profiles\n# Verbose mode (true, false) - enable to see more during chat\nCHAT_VERBOSE=false\n```\n\n\u003c/details\u003e\n\n\n## Running the Frontend and Backend\n\n\nTo run the backend API and the frontend UI, follow these steps:\n\n1. **Start the Backend API**:\n   Open a terminal and navigate to the root directory of the project. Then run:\n   ```bash\n   python run_api.py\n   ```\n\n2. **Start the Frontend UI**:\n   Open a separate terminal, navigate to the `frontend` directory, and run:\n   ```bash\n   npm run dev\n   ```\n\n3. **Access the Web UI**:\n   Open your web browser and go to:\n   ```\n   http://localhost:3001/\n   ```\n\nThis will start the backend API on port 8001 and the frontend dev server on port 3001 (see `frontend/vite.config.ts`).\n\n### Logging\n\nThe backend uses a **rotating application log** by default: `log/app.log` (configure `APP_LOG_DIR`, `LOG_FILE`, `LOG_LEVEL`, and rotation via `.env`; see `.env.example`). HTTP access lines (`api_http METHOD /path -\u003e status ms`), chat traces, and other services share that file. **Important operator actions** (for example site delete) are also written to **`log/audit.log`** by default (`AUDIT_LOG_FILE`, `AUDIT_LOG_*` rotation, `AUDIT_LOG_ENABLED=false` to disable). There is no separate `log/api/` directory. Set `API_ACCESS_LOG=false` to turn off per-request `api_http` lines; `/api/health`, `/docs`, `/redoc`, and `/openapi.json` are skipped to reduce noise. Optional per-crawl detail logs may appear under `log/crawl/` when crawls run.\n\n If you need a complete solution - crawl4ai with or without a local Supabase all in Docker see [Docker Deployment](#docker-deployment) section of the README\n\n ---\n\n![Image](https://github.com/user-attachments/assets/a56cf708-dfe5-4aa7-a854-0685867cee18)\n\n\u003cdetails open\u003e\n\u003csummary\u003eClick to close or open images\u003c/summary\u003e\n\n- Crawl a url or sitemap\n![Image](https://github.com/user-attachments/assets/9155d59e-303e-484f-96de-1a7a917eeefe)\n---\n\n- Chat with your docs!\n![Image](https://github.com/user-attachments/assets/86b26bf1-15e9-4cc9-9cba-6a666c2d0646)\n---\n\n- Manage your sites\n![Image](https://github.com/user-attachments/assets/f106ebd6-96a8-448d-b871-eaa9038169f9)\n---\n\n- Search your crawled pages - view chunks\n![Image](https://github.com/user-attachments/assets/6b0defc7-d0c9-405b-bbf4-1c8045e6f206)\n---\n\n- View your sites - parent pages and chunks\n![Image](https://github.com/user-attachments/assets/4c1a0b52-953e-4dc8-b7a9-4a62ecb56c93)\n---\n\n- Get detailed info - view whats in the db, view raw, render in md or fetch url live!\n![Image](https://github.com/user-attachments/assets/4042e241-23ca-4009-b629-a362dac74f7a)\n---\n\n- As you chat the AI will add preferences based on your conversation and remeber them or you can add manually\n![Image](https://github.com/user-attachments/assets/5836dfa8-3b09-406e-a5d9-392114d07b99)\n---\n\n- Dedicated User Guide\n![Image](https://github.com/user-attachments/assets/b6e00fb5-e55a-4f8a-8019-16a73ce347df)\n\n\u003c/details\u003e\n\n\n\n## Database Connection Options\n\nThe project supports two ways to connect to your Supabase database:\n\n1. **Single URL** (Option 1): Use this for both local and remote connections. The URL can be specified with or without protocol.\n   ```\n   # With protocol (for remote instances)\n   SUPABASE_URL=https://your-project.supabase.co:5432\n   \n   # Without protocol (for local instances)\n   SUPABASE_URL=192.168.xx.xx:54322\n   ```\n\nYou'll need to provide the database credentials:\n\n   ```env\n   SUPABASE_DB=postgres\n   SUPABASE_KEY=postgres\n   SUPABASE_PASSWORD=postgres\n   ```\n\n### Content Chunking for LLM Interaction\n\n\u003cdetails\u003e\n\u003csummary\u003eClick to expand\u003c/summary\u003e\nThe system automatically breaks down large content into smaller, more manageable chunks for better LLM interaction and more precise search results. This provides several benefits:\n\n1. **Improved Search Precision**: Instead of matching against entire pages, the system can find the specific chunk that best answers a query.\n\n2. **Efficient Token Usage**: When interacting with LLMs, only the relevant chunks are sent, reducing token usage and costs.\n\n3. **Better Context Management**: Each chunk maintains a reference to its parent page, preserving the full context.\n\n4. **Automatic Token Limit Handling**: Content is automatically chunked to stay within the token limits of the embedding model (8,192 tokens for text-embedding-3-small).\n\u003c/details\u003e\n\n### How Chunking Works\n\n\u003cdetails\u003e\n\u003csummary\u003eClick to expand chunking details\u003c/summary\u003e\n\nThe system uses a sophisticated semantic chunking strategy:\n\n1. **Semantic Boundary Detection**: Content is first split along natural semantic boundaries:\n   - Markdown headers (e.g., `# Section Title`)\n   - Paragraph breaks\n   - This preserves the meaning and context of each chunk\n\n2. **Token-Based Sizing**: Each section is then analyzed to ensure it fits within token limits:\n   - Sections that fit are kept together\n   - Sections that exceed limits are further split with token-based chunking\n   - A 200-token overlap is maintained between chunks for context continuity\n\n3. **Smart Overlap**: When creating overlaps between chunks, the system looks for natural break points:\n   - Paragraph breaks\n   - Sentence endings\n   - Clause breaks\n   - Word boundaries\n\n4. **Metadata Preservation**: Each chunk maintains references to:\n   - Its parent document\n   - Its position in the sequence (chunk index)\n   - Its token count\n\nThis approach ensures that chunks are not only sized appropriately for LLMs but also maintain semantic coherence, making them more useful for search and retrieval.\n\u003c/details\u003e\n\n### Configuring Chunking\n\n\u003cdetails\u003e\n\u003csummary\u003eClick to expand chunking configuration\u003c/summary\u003e\n\nChunk size and overlap are configured via environment variables (see `.env.example`):\n\n```env\nCHUNK_MAX_TOKENS=900\nCHUNK_OVERLAP_TOKENS=120\n```\n\nThese feed the crawler’s chunking pipeline so you can tune token usage and overlap without editing code. Values should stay within the embedding model’s context limit (for example 8,192 tokens for `text-embedding-3-small`).\n\u003c/details\u003e\n\n## Testing the Setup\n\nBefore using the crawler, you can test your setup:\n\n1. Test the database connection:\n   ```\n   python tests/test_db_connection.py\n   ```\n\n2. Test the Crawl4AI API:\n   ```\n   python tests/test_crawl_api.py\n   ```\n\n3. Run the Python test suite (install dev dependencies first):\n   ```bash\n   pip install -r requirements-dev.txt\n   pytest\n   ```\n\n   This includes tests for crawl API behavior, URL validation (`security_utils`), and chunking helpers.\n\n## Usage\n\n### Setting up the database\n\nBefore using the crawler, you need to set up the database:\n\n```\npython main.py setup\n```\n\nThis will create the necessary tables and extensions in your Supabase database.\n\n## Terminal\n\n\n\u003cdetails\u003e\n\u003csummary\u003eClick to expand images of terminal\u003c/summary\u003e\n\n![Image](https://github.com/user-attachments/assets/bd49c8df-4f45-4981-bd29-64fadf29a0e0)\n![Image](https://github.com/user-attachments/assets/c3071ef4-c516-43f9-8496-b49ddc59a55e)\n\n![Image](https://github.com/user-attachments/assets/7aa1d142-4a70-4d86-a17d-860ab633ae3f)\n\n\u003c/details\u003e\n\n\n### Crawling a website with args\n\n\u003cdetails\u003e\n\u003csummary\u003eClick to expand website crawling options\u003c/summary\u003e\n\nYou can crawl a website in two ways:\n\n1. Using the command-line interface:\n   ```\n   python main.py crawl https://example.com --name \"Example Site\" --description \"An example website\"\n   ```\n\n   To crawl a sitemap:\n   ```\n   python main.py crawl https://example.com/sitemap.xml --sitemap --name \"Example Site\"\n   ```\n\n   You can limit the number of URLs to crawl from the sitemap:\n   ```\n   python main.py crawl https://example.com/sitemap.xml --sitemap --max-urls 20\n   ```\n\n   Note: If you don't provide a description, the system will automatically generate one based on the content of the homepage or main page.\n\n2. Using the `.env` file configuration: ( recommended )\n   \n   First, update the `.env` file with your crawl settings:\n   ```\n   CRAWL_TYPE=url  # or 'sitemap' for sitemap crawling\n   CRAWL_URL=https://example.com\n   CRAWL_SITE_NAME=Example Site\n   CRAWL_SITE_DESCRIPTION=An example website  # Optional - will be auto-generated if empty\n   ```\n\n   Then run:\n   ```\n   python run_crawl.py\n   ```\n\u003c/details\u003e\n\n\n### Title and Summary Generation\n\nThe crawler automatically generates titles and summaries for crawled content using OpenAI. You can configure the model used for this in the `.env` file:\n\n```\nOPENAI_CONTENT_MODEL=gpt-4o-mini\n```\n\n#### Updating Existing Content\n\n\u003cdetails\u003e\n\u003csummary\u003eClick to expand content updating options\u003c/summary\u003e\n\nIf you have existing pages without titles or summaries, or if you want to regenerate them with a different model, you can use the `update_content.py` script:\n\n```\n# Update all sites\npython update_content.py\n\n# Update a specific site\npython update_content.py --site-id 1\n\n# Limit the number of pages to update\npython update_content.py --limit 50\n\n# Force update all pages, even if they already have titles and summaries\npython update_content.py --force\n```\n\u003c/details\u003e\n\n### Searching the crawled content\n\n\u003cdetails\u003e\n\u003csummary\u003eClick to expand search options\u003c/summary\u003e\n\nTo search the crawled content using semantic search:\n\n```\npython main.py search \"your search query\"\n```\n\nTo use text-based search instead of semantic search:\n\n```\npython main.py search \"your search query\" --text-only\n```\n\n![Image](https://github.com/user-attachments/assets/806c80ae-1fbf-4680-990e-9d2b9b3bbaa8)\n\nTo adjust the similarity threshold and limit the number of results:\n\n```\npython main.py search \"your search query\" --threshold 0.8 --limit 2\n```\n\nTo save the search results to a file:\n\n```\npython main.py search \"your search query\" --output results.json\n```\n\u003c/details\u003e\n\n### Listing crawled sites\n\nTo list all the sites that have been crawled:\n\n```\npython main.py list-sites\n```\n\n![Image](https://github.com/user-attachments/assets/c7fce24d-50e8-447e-8900-15ffcb56ce92)\n\nBy default, this only counts parent pages (not chunks). To include chunks in the page count:\n\n```\npython main.py list-sites --include-chunks\n```\n\n### Working with Chunks\n\n\u003cdetails\u003e\n\u003csummary\u003eClick to expand details on working with chunks\u003c/summary\u003e\n\nWhen retrieving or searching content, you can control whether chunks are included:\n\n```python\n# Get pages for a site (parent pages only)\npages = crawler.get_site_pages(site_id, limit=100)\n\n# Get pages for a site including chunks\npages_with_chunks = crawler.get_site_pages(site_id, limit=100, include_chunks=True)\n```\n\nWhen searching, chunks are automatically included and prioritized for more precise results. Each chunk includes context about its parent document:\n\n```\npython main.py search \"your search query\"\n```\n\nThe search results will include:\n- The content snippet that matched your query\n- Which document it came from\n- Which part of the document it represents (e.g., \"Part 2 of 5\")\n\nThis makes it easier to understand the context of each search result, even when it's a small chunk of a larger document.\n\u003c/details\u003e\n\n---\n### Using the chat interface in terminal\n\n![Image](https://github.com/user-attachments/assets/34d79a96-2d60-4221-a1f7-3a8582129855)\n\n\nThe project includes a chat interface in the terminal that uses an LLM to answer questions based on the crawled data. The chat interface now supports persistent conversation history, allowing the LLM to remember previous interactions even after restarting the application.\n\nYou can start the terminal chat interface using either the dedicated script or the main CLI:\n\n```bash\n# Using the dedicated script\npython chat.py\n\n# Using the main CLI\npython main.py chat\n```\n\n#### Chat Interface Options\n\n\u003cdetails\u003e\n\u003csummary\u003eClick to expand chat interface options\u003c/summary\u003e\n\nYou can customize the chat interface with various options:\n\n```bash\n# Specify a different OpenAI model\npython main.py chat --model gpt-4\n\n# Set the maximum number of search results to retrieve when chatting\npython main.py chat --limit 10\n\n# Adjust the similarity threshold for vector search (0-1)\npython main.py chat --threshold 0.6\n\n# Use a specific session ID for persistent conversations\npython main.py chat --session my-chat-session\n\n# Associate the conversation with a specific user\npython main.py chat --user John\n\n# Enable verbose debug output\npython main.py chat --verbose\n\n# Combined\npython main.py chat --model gpt-4 --limit 15 --threshold 0.3 --session 12123111111 --user John --verbose\n```\n\u003c/details\u003e\n\n#### Search Functionality\n\n\u003cdetails\u003e\n\u003csummary\u003eClick to expand search functionality details\u003c/summary\u003e\n\nThe chat interface uses a sophisticated hybrid search approach that combines vector similarity with text matching:\n\n1. **Vector Search**: Uses OpenAI's embeddings to find semantically similar content\n2. **Text Search**: Enhances results with keyword matching for better precision\n3. **Hybrid Approach**: Combines both methods to provide the most relevant results\n\nThis approach ensures that even when vector similarity might not find exact matches, the text search component can still retrieve relevant information. The system automatically adjusts the search strategy based on the query type and available content.\n\u003c/details\u003e\n\n#### Persistent Conversation History\n\n\u003cdetails\u003e\n\u003csummary\u003eClick to expand conversation history details\u003c/summary\u003e\n\nThe chat interface stores all conversation history in the database, allowing the LLM to remember previous interactions. This enables more natural and contextual conversations over time.\n\nKey features:\n- **Session-based conversations**: Each conversation gets a unique session ID\n- **User identification**: Optionally associate conversations with specific users\n- **Conversation continuity**: Continue conversations where you left off, even after restarting\n- **Chat commands**:\n  - Type `clear` to clear the conversation history\n  - Type `history` to view the conversation history\n  - Type `exit` or `bye` or `exit` to quit the chat interface\n\n**Important**: To maintain the same conversation across multiple chat sessions, you must use the same session ID. The session ID is displayed when you start the chat interface. You can specify it before starting a new chat session:\n\n```bash\n# Start a new chat session\npython chat.py --user Joe\n# Note the session ID displayed (e.g., \"Session ID: a24b6b72-e526-4a09-b662-0f85e82f78a7\")\n\n# Later, continue the same conversation by specifying the session ID\npython chat.py --user Joe --session a24b6b72-e526-4a09-b662-0f85e82f78a7\n```\n\nYou can also set a default session ID in your `.env` file:\n\n```\nCHAT_SESSION_ID=your-session-id\n```\n\nThis way, the chat interface will always use the same session ID unless you explicitly specify a different one with the `--session` parameter.\n\u003c/details\u003e\n\n#### User Preferences and Memory\n\n\u003cdetails\u003e\n\u003csummary\u003eClick to expand user preferences and memory details\u003c/summary\u003e\n\nThe chat interface can remember user preferences and information shared during conversations, as long as you use the same session ID. For example:\n- If you tell the assistant \"I like Corvettes\" in one session\n- Then in a later session (using the same session ID), ask \"What cars do I like?\"\n- The assistant will remember and respond with \"You like Corvettes\"\n\nThis memory persistence works by:\n1. Storing all messages in the database with the session ID\n2. Analyzing conversation history when relevant questions are asked\n3. Extracting user preferences and information from previous messages\n\nTo get the most out of this feature, always use the same session ID and user ID when you want the assistant to remember previous conversations.\n\n#### Managing User Preferences via CLI\n\nThe chat interface includes several commands for managing user preferences directly from the command line:\n\n**Viewing Preferences**\n```\npreferences\n```\nDisplays a table of all active preferences for the current user, including ID, type, value, confidence, context, and last used timestamp.\n\n**Adding Preferences**\n```\nadd preference \u003ctype\u003e \u003cvalue\u003e [confidence]\n```\nManually adds a new preference for the current user. If confidence is not specified, it defaults to 0.9.\n\nExamples:\n```\nadd preference like Python\nadd preference expertise JavaScript 0.85\nadd preference goal \"Learn machine learning\"\n```\n\n**Deleting Preferences**\n```\ndelete preference \u003cid\u003e\n```\nDeletes a specific preference by ID.\n\n**Clearing All Preferences**\n```\nclear preferences\n```\nDeletes all preferences for the current user after confirmation.\n\n**Important**: Preference commands are only available when a user ID is provided (using `--user` when starting the chat). For more detailed information about the user preference system, see the [preferences documentation](docs/preferences.md).\n\u003c/details\u003e\n\n### Chat Profiles\n\n\u003cdetails\u003e\n\u003csummary\u003eClick to expand chat profiles details\u003c/summary\u003e\n\nThe chat interface supports different profiles that customize the behavior of the assistant. Each profile has its own system prompt, search settings, and site filtering capabilities. So ideally crawl the sitemap for a doc site and then use or create a profile with an additional system prompt to be an expert about those docs.\n\nBuilt-in profiles:\n- **default**: General-purpose assistant that searches all sites\n- **pydantic**: Specialized for Pydantic documentation, focusing on technical details and code examples\n- **technical**: Provides detailed technical explanations with step-by-step instructions\n- **concise**: Gives brief, to-the-point answers without unnecessary details\n\n\nYou can switch profiles during a chat session:\n```\nprofile pydantic\n```\n\nOr start with a specific profile:\n```bash\npython main.py chat --profile technical\n```\n\nYou can also view all available profiles:\n```\nprofiles\n```\n\u003c/details\u003e\n\n#### How Site Filtering Works\n\n\u003cdetails\u003e\n\u003csummary\u003eClick to expand site filtering details\u003c/summary\u003e\n\nThe `sites` array in each profile's `search_settings` controls which sites the assistant searches through when answering questions:\n\n```yaml\nsearch_settings:\n  sites: [\"pydantic\"]  # Only search in sites with \"pydantic\" in the name\n  threshold: 0.6\n  limit: 8\n```\n\nHere's how the filtering works:\n\n1. **Empty array (`sites: []`)**: Searches across ALL sites in the database\n2. **Site patterns**: Filters to only include sites where the site name contains any of the specified patterns\n3. **Pattern matching**: Uses case-insensitive partial matching, so `\"bigsk1\"` would match site names like \"Bigsk1 Com\", \"bigsk1.com\", etc.\n4. **Multiple patterns**: You can include multiple patterns to search across several related sites\n\nThe filtering process:\n- When a user asks a question, the system looks at the current profile's `sites` setting\n- It queries the `crawl_sites` table to find site IDs where the name contains any of the patterns\n- It then only searches for content in pages associated with those site IDs\n- This allows profiles to focus on specific content sources, making responses more relevant\n\u003c/details\u003e\n\nYou can switch profiles during a chat session:\n```\nprofile pydantic\n```\n\nOr start with a specific profile:\n```bash\npython main.py chat --profile technical\n```\n\nYou can also view all available profiles:\n```\nprofiles\n```\n\n#### Custom Profiles\n\n\u003cdetails\u003e\n\u003csummary\u003eClick to expand custom profiles details\u003c/summary\u003e\n\nYou can create your own custom profiles by adding YAML files to the `profiles` directory. Each profile file should include:\n\n- `name`: The name of the profile (used to select it)\n- `description`: A brief description of the profile\n- `system_prompt`: The system prompt that defines the assistant's behavior\n- `search_settings`: Configuration for search behavior\n  - `sites`: List of site name patterns to filter by (empty list means search all sites)\n  - `threshold`: Similarity threshold for vector search (0-1)\n  - `limit`: Maximum number of results to return\n\nExample profile file (`profiles/custom_expert.yaml`):\n```yaml\nname: custom_expert\ndescription: Custom expert for specific documentation\nsystem_prompt: |\n  You are an expert on [specific topic].\n  \n  Your expertise includes:\n  - [Area of expertise 1]\n  - [Area of expertise 2]\n  - [Area of expertise 3]\n  \n  When answering questions:\n  - [Instruction 1]\n  - [Instruction 2]\n  - [Instruction 3]\n\nsearch_settings:\n  sites: [\"site1\", \"site2\"]  # Only search in sites containing these terms\n  threshold: 0.6  # Higher threshold for more precise matches\n  limit: 8  # Number of results to return\n```\n\nYou can specify a custom profiles directory:\n```bash\npython main.py chat --profiles-dir my_profiles\n```\n\u003c/details\u003e\n\n#### Configuration via .env\n\n\u003cdetails\u003e\n\u003csummary\u003eClick to expand .env configuration details\u003c/summary\u003e\n\nYou can set default values for the chat interface in your `.env` file:\n\n```\n# Chat Configuration\nCHAT_MODEL=gpt-4o\nCHAT_RESULT_LIMIT=5\nCHAT_SIMILARITY_THRESHOLD=0.5\nCHAT_SESSION_ID=default-session\nCHAT_USER_ID=default-user\nCHAT_PROFILE=default\nCHAT_PROFILES_DIR=profiles\nCHAT_VERBOSE=false\n```\n\nThis allows you to maintain consistent settings and continue the same conversation across multiple sessions.\n\u003c/details\u003e\n\n### Resetting the database\n\n\u003cdetails\u003e\n\u003csummary\u003eClick to expand Resetting the database\u003c/summary\u003e\nIf you want to start fresh and delete all data or recreate the tables, you can use the `reset_database.py` script:\n\n```\npython tests/reset_database.py\n```\n\nThis script provides two options:\n1. Delete all data (keep tables) - This will delete all data from the tables but keep the table structure.\n2. Drop and recreate tables - This will drop the tables and recreate them, effectively starting from scratch.\n\u003c/details\u003e\n\n### Programmatic usage\n\nYou can also use the crawler programmatically in your own Python code. See `tests/example.py` for a demonstration.\n\n## Project Structure\n\n\u003cdetails\u003e\n\u003csummary\u003eClick to expand Project Structure\u003c/summary\u003e\n\n### Backend\n\n- `main.py`: Main script with command-line interface\n- `crawler.py`: Main crawler class that ties everything together\n- `crawl_client.py`: Client for interacting with the Crawl4AI API\n- `embeddings.py`: Module for generating OpenAI embeddings\n- `content_enhancer.py`: Module for generating titles and summaries using OpenAI\n- `db_client.py`: Client for interacting with the Supabase database\n- `db_setup.py`: Script for setting up the database\n- `chat.py`: Chat interface for interacting with crawled data using an LLM\n- `run_api.py`: Script to run the API\n- `run_crawl.py`: Script to run a crawl using the configuration from the `.env` file\n- `update_content.py`: Script to update existing pages with titles and summaries\n- `utils.py`: Utility functions for the CLI\n- `requirements.txt`: List of dependencies for the backend\n- `.env.example`: Example environment file for the backend\n- `api/`: Directory containing the FastAPI implementation\n  - `main.py`: FastAPI application entry point (optional API key dependency, CORS, auto-refresh hooks)\n  - `auth.py`: Optional `SCC_API_KEYS` / `API_KEYS` validation\n  - `routers/`: Directory containing API route definitions\n    - `crawl.py`: Endpoints for crawling websites and sitemaps\n    - `search.py`: Endpoints for searching crawled content\n    - `sites.py`: Endpoints for managing and retrieving site information\n    - `chat.py`: Endpoints for interacting with the chat interface\n    - `pages.py`: Endpoints for managing and retrieving page information\n  - `README.md`: Comprehensive API documentation\n- `security_utils.py`: URL validation for crawl/fetch targets and shared env parsing helpers\n- `requirements-dev.txt`: Dev dependencies (e.g. `pytest`) layered on `requirements.txt`\n- `docker/`: Directory containing Docker-related files\n  - `Dockerfile`: Docker image definition for the backend application\n  - `frontend.Dockerfile`: Docker image definition for the frontend application\n  - `docker-compose.yml`: Docker Compose configuration for the API service only\n  - `crawl4ai-docker-compose.yml`: Docker Compose configuration for integrated API and Crawl4AI services\n  - `full-stack-compose.yml`: Docker Compose configuration for the complete stack (API, Crawl4AI, Supabase, Frontend)\n  - `setup.sh`: Script to set up the full stack environment\n  - `reset.sh`: Script to reset the full stack environment\n  - `status.sh`: Script to check the status of the full stack environment\n  - `.env`: Environment variables for Docker deployment\n  - `.env.example`: Example environment file for Docker deployment\n  - `full-stack/`: Documentation and utilities for the full stack setup\n    - `README.md`: Documentation for the full stack setup\n    - `ENV_GUIDE.md`: Guide for configuring environment variables\n    - `check_db_connections.sh`: Script to verify database connections\n  - `volumes/`: Directory for Docker volumes\n  - `.dockerignore`: Specifies files to exclude from Docker builds\n- `supabase_explorer/`: Directory containing the Supabase Explorer Streamlit app\n  - `supabase_explorer.py`: Interactive Streamlit app for database exploration\n  - `supabase_queries.md`: Collection of useful SQL queries\n  - `database_explorer_readme.md`: Documentation for the Supabase Explorer\n- `profiles/`: Directory containing chat profile configurations\n  - Various YAML files defining different chat personalities and behaviors\n- `tests/`: Directory containing test scripts\n  - `example.py`: Example script demonstrating programmatic usage\n  - `test_db_connection.py`: Script to test the database connection\n  - `test_crawl_api.py`: Script to test the Crawl4AI API\n  - `test_security_utils.py`, `test_crawler_chunking.py`: Pytest coverage for URL rules and chunking\n  - `smoke_api.py`: Optional smoke checks against a running API\n  - `reset_database.py`: Script to delete tables or reset the database\n\n### Frontend\n\n- `frontend/`: Directory containing the React-based web UI\n  - `src/`: Source code for the frontend application\n    - `api/`: API client for communicating with the backend\n      - `apiService.ts`: Service for making API requests\n      - `apiWrapper.ts`: Wrapper for API endpoints with type definitions\n    - `components/`: Reusable UI components\n      - `Layout.tsx`: Main layout component with Sidebar and Navbar\n      - `Navbar.tsx`: Top navigation bar\n      - `Sidebar.tsx`: Side navigation menu\n      - `NotificationCenter.tsx`: Notification system for user alerts\n      - `PageListItem.tsx`: Component for displaying page items in lists\n      - `UserProfileModal.tsx`: Modal for user profile management\n      - `ui/`: Shadcn UI component library\n        - Various UI components like buttons, inputs, dialogs, etc.\n    - `context/`: React context providers for state management\n    - `hooks/`: Custom React hooks\n    - `lib/`: Utility libraries and configurations\n    - `pages/`: Main application views\n      - `HomePage.tsx`: Landing page\n      - `ChatPage.tsx`: AI chat interface\n      - `CrawlPage.tsx`: Web crawling interface\n      - `SearchPage.tsx`: Search interface\n      - `SitesPage.tsx`: Site management\n      - `SiteDetailPage.tsx`: Detailed view of a crawled site\n      - `NotificationInfo.tsx`: Notification settings and information\n      - `UserProfileModal.tsx`: User profile management\n      - `UserPreferencesPage.tsx`: User preferences management\n    - `styles/`: CSS and styling files\n    - `utils/`: Utility functions\n    - `App.tsx`: Main application component\n    - `main.tsx`: Entry point for the React application\n  - `public/`: Static assets\n  - `index.html`: HTML entry point\n  - `vite.config.ts`: Vite configuration\n  - `tailwind.config.js`: Tailwind CSS configuration\n  - `tsconfig.json`: TypeScript configuration\n  - `package.json`: NPM dependencies and scripts\n\n---\n\u003c/details\u003e\n\n## Database Structure\n\n\u003cdetails\u003e\n\u003csummary\u003eClick to expand Database Structure\u003c/summary\u003e\n\n![Image](https://github.com/user-attachments/assets/629345d4-3dea-489b-be0e-65cb07f53d9a)\n\n\nThe project uses the following tables in the Supabase database:\n\n1. `crawl_sites`: Stores information about the sites you've crawled\n   - `id`: Primary key\n   - `name`: Name of the site\n   - `url`: URL of the site\n   - `description`: Optional description of the site\n   - `created_at`: Timestamp when the site was added\n\n2. `crawl_pages`: Stores the actual content, embeddings, titles, and summaries for each page\n   - `id`: Primary key\n   - `site_id`: Foreign key referencing the `crawl_sites` table\n   - `url`: URL of the page (unique)\n   - `title`: Title of the page\n   - `content`: Content of the page\n   - `summary`: Summary of the page\n   - `embedding`: Vector embedding of the content\n   - `metadata`: Additional metadata about the page\n   - `is_chunk`: Boolean indicating if this is a chunk of a larger page\n   - `chunk_index`: Index of the chunk within the parent page\n   - `parent_id`: Foreign key referencing the parent page\n   - `created_at`: Timestamp when the page was added\n   - `updated_at`: Timestamp when the page was last updated\n\n3. `chat_conversations`: Stores conversation history for the chat interface\n   - `id`: Primary key\n   - `session_id`: Unique identifier for the conversation session\n   - `user_id`: Optional identifier for the user\n   - `timestamp`: Timestamp when the message was sent\n   - `role`: Role of the message sender (user, assistant, system)\n   - `content`: Content of the message\n   - `metadata`: Additional metadata about the message\n\nWhen you crawl a site multiple times, the system will update existing pages rather than creating duplicates, ensuring you always have the most recent content. Similarly, the chat interface will maintain conversation history across sessions, allowing for more natural and contextual interactions.\n\u003c/details\u003e\n\n## Supabase Explorer\n\n![Image](https://github.com/user-attachments/assets/26e7681b-7835-4bc3-9d64-08f2d314f77f)\n\n\n![Image](https://github.com/user-attachments/assets/85388083-a1d0-4054-8913-e67c6c1fb90e)\n\nThe project includes a powerful Streamlit-based Supabase Explorer app that allows you to interactively explore and analyze your database. This tool makes it easy to run SQL queries, visualize results, and gain insights from your crawled data.\n\n### Features\n\n- **Interactive Query Interface**: Run predefined or custom SQL queries with a single click\n- **Data Visualization**: Create bar charts, line charts, and pie charts from your query results\n- **Database Overview**: View statistics about your database, including site counts and page distribution\n- **Export Functionality**: Download query results as CSV files for further analysis\n- **Predefined Queries**: Access a comprehensive collection of useful SQL queries organized by category:\n  - Site queries\n  - Page queries\n  - Chunk queries\n  - Metadata queries\n  - Conversation history queries\n  - Statistics queries\n  - Embedding analysis queries\n  - Content quality queries\n  - Advanced conversation analysis\n  - Performance queries\n  - Search performance analysis\n\n### Running the Supabase Explorer\n\nTo launch the Supabase Explorer:\n\n```bash\ncd supabase_explorer\npip install -r requirements.txt\nstreamlit run supabase_explorer.py\n```\n\nThe app will automatically connect to your Supabase database using the credentials in your root `.env` file.\n\n#### Running Supabase Explorer in Docker\n\nThe Supabase Explorer is also available as part of the Docker setup. When you run either of the Docker Compose configurations, the Streamlit app will be accessible at:\n\n```\nhttp://localhost:8501\n```\n\nThis allows you to explore your database directly from the Docker container without having to install Streamlit locally.\n\n```bash\n# Start the Docker containers including the Supabase Explorer\ndocker-compose -f docker/docker-compose.yml up -d\n\n# Or with the integrated Crawl4AI setup\ndocker-compose -f docker/crawl4ai-docker-compose.yml up -d\n```\n\n### Adding Custom Queries\n\n\u003cdetails\u003e\n\u003csummary\u003eClick to expand Adding Custom Queries\u003c/summary\u003e\n\nYou can add your own custom queries to the predefined list by editing the `supabase_explorer/supabase_queries.md` file. Follow the existing format:\n\n```markdown\nYour Category\n\nYour Query Name\n\n```sql\nSELECT * FROM your_table WHERE your_condition;\n```\n\n\nAfter adding your queries, restart the Streamlit app to load the new queries.\n\u003c/details\u003e\n\n\n\n## Docker Deployment - 3 different options! \n\n- Use the docker compose file to start the app\n![Image](https://github.com/user-attachments/assets/a56cf708-dfe5-4aa7-a854-0685867cee18)\n\n\n- If terminal is your thing then you can also just exec into the container\n![Image](https://github.com/user-attachments/assets/be41b857-47ca-4804-97e2-98764a270748)\n\n\n\n```bash\n# Build and start the container\ndocker-compose -f docker/docker-compose.yml up -d\n\n# View logs\ndocker-compose -f docker/docker-compose.yml logs -f\n```\n\nIf the Crawl page or other `/api` calls return **401** in Docker but work with `npm run dev` + Python on the host, the API is seeing the **frontend container’s IP** on the Docker bridge (not `127.0.0.1`). The compose file sets **`SUPA_API_TRUST_CIDRS`** for common private ranges so **`SUPA_API_AUTH` / API keys** still allow browser traffic through the Vite proxy. Override **`SUPA_API_TRUST_CIDRS`** in `.env` if your bridge uses a different range, and include **`172.16.0.0/12`** unless you terminate TLS and use **`SUPA_API_TRUST_FORWARDED`** with a trusted proxy.\n\nThis setup includes:\n- API backend on port 8001\n- Frontend UI on port 3001\n- Streamlit Explorer on port 8501\n\n---\n\n### Integrated Crawl4AI Docker Deployment\n\nIf you want to run both the API and Crawl4AI in Docker containers, this is when you already have a supabase locally or externally, you can use the provided `crawl4ai-docker-compose.yml` file:\n\n```bash\n# Build and start both containers\ndocker-compose -f docker/crawl4ai-docker-compose.yml up -d\n\n# View logs\ndocker-compose -f docker/crawl4ai-docker-compose.yml logs -f\n```\n\nThis setup will:\n1. Start a Crawl4AI container using the official image from Docker Hub\n2. Start your API container with the correct configuration to connect to Crawl4AI\n3. Start the frontend UI container for the web interface\n4. Start the Streamlit Explorer for database exploration\n5. Create a network for the containers to communicate with each other\n\nMake sure your `.env` file in root includes the necessary Crawl4AI configuration:\n\n```env\n# Crawl4AI Configuration\nCRAWL4AI_API_TOKEN=your_crawl4ai_api_token\n# This will be automatically set to the Docker service name in the container\n# CRAWL4AI_BASE_URL=http://crawl4ai:11235\n```\n\nAccess the services:\n- API: http://localhost:8001\n- Frontend UI: http://localhost:3000\n- Streamlit Explorer: http://localhost:8501\n- Crawl4AI: http://localhost:11235\n\n---\n\n### Full Stack Docker Setup (Supabase + API + Crawl4AI + Frontend)\n\nWe provide a comprehensive Docker setup that includes everything you need to run the entire application stack:\n\n- Supa Chat API Backend\n- Frontend UI\n- Supabase Docker images (Database, Kong, Realtime, etc.)\n- Crawl4AI Docker image for web crawling\n\nThis setup comes with everything you need to run the complete application without any external dependencies.\n\n\u003cdetails\u003e\n\u003csummary\u003eClick to expand more images of the UI\u003c/summary\u003e\n\n#### Important Environment Variable Configuration\n\nThe full-stack Docker setup requires careful configuration of environment variables:\n\n1. **SUPABASE_URL**: This should be commented out or left empty to ensure the API connects directly to the database:\n   ```\n   # SUPABASE_URL=http://kong:8002\n   ```\n   \n   If this is set, the API will try to connect to Kong for database operations, which will cause SSL negotiation errors.\n\n2. **Direct Database Connection**: Ensure these database connection parameters are set correctly:\n   ```\n   SUPABASE_HOST=db\n   SUPABASE_PORT=5432\n   SUPABASE_KEY=supabase_admin\n   SUPABASE_PASSWORD=${POSTGRES_PASSWORD}\n   ```\n\n#### Setting Up the Full Stack\n\nTo use the full stack Docker setup:\n\n1. Navigate to the docker directory:\n   ```bash\n   cd docker\n   ```\n\n2. Run the setup script to create necessary configuration files:\n   ```bash\n   chmod +x setup_update.sh\n   ./setup_update.sh\n   ```\n   \n   This script will:\n   - Check for the existence of the `.env` file\n   - Create SQL scripts for database initialization\n   - Download Supabase initialization scripts\n   - Create application tables and functions\n   - Generate the Kong configuration file\n\n3. Edit the Docker-specific `.env` file with your actual values:\n   ```bash\n   nano .env\n   ```\n\n4. Start the services:\n   ```bash\n   docker-compose -f full-stack-compose.yml up -d\n   ```\n\n5. Access the services:\n   - API: http://localhost:8001\n   - API Documentation: http://localhost:8001/docs\n   - Frontend UI: http://localhost:3000\n   - Supabase Studio: http://localhost:3001 (username: supabase, password: from your .env file)\n   - Kong API Gateway: http://localhost:8002\n   - Crawl4AI: http://localhost:11235\n\n6. Monitor or manage the stack:\n   ```bash\n   # Check status of all services\n   ./status.sh\n   \n   # Reset the stack (removes all data)\n   ./reset.sh\n   ```\n\n#### Troubleshooting\n\n1. **Database Connection Issues**:\n   - If you see SSL negotiation errors, make sure `SUPABASE_URL` is commented out or empty in your `.env` file\n   - Verify the database credentials in the `.env` file\n   - Restart the API service after making changes:\n     ```bash\n     docker-compose -f full-stack-compose.yml restart api\n     ```\n\n2. **REST Service Issues**:\n   - If the REST service is not connecting properly, run the fix script:\n     ```bash\n     ./fix_rest.sh\n     ```\n\n3. **Checking Logs**:\n   - View logs for a specific service:\n     ```bash\n     docker logs supachat-api\n     docker logs supachat-kong\n     docker logs supachat-frontend\n     ```\n\n\u003c/details\u003e\n\nFor more detailed instructions, see the [Docker README](docker/full-stack/README.md), [System Flows Documentation](docs/SYSTEM_FLOWS.md), and the **[HTTP API guide](docs/API.md)**.\n\n## API\n\nThe project includes a FastAPI-based REST API that allows you to integrate the Supa-Crawl-Chat functionality with other applications or build custom frontends. The API provides endpoints for searching, crawling, managing sites, pages, and chatting.\n\n**Full reference:** [docs/API.md](docs/API.md) (auth headers, env vars, rate limits, public routes, curl examples). **Index of all docs:** [docs/README.md](docs/README.md). API traffic is logged in the same file as the rest of the app (see [Logging](#logging) above), not under a separate `log/api/` path.\n\n### API security (optional)\n\nAuthentication is enforced when **`SUPA_API_AUTH`** is enabled (with **`SUPA_API_KEY`**) and/or legacy **`SCC_API_KEYS`** / **`API_KEYS`** is set. Non-trusted clients must send:\n\n- Header `x-api-key: \u003csecret\u003e`, or  \n- Header `Authorization: Bearer \u003csecret\u003e`\n\n(`x-api-key` wins if both are sent.) Localhost and optional trusted CIDRs may bypass keys; **`GET /api/health`** and **`/api/auth/webui/*`** stay public. WebUI password protection uses **`WEBUI_PASSWORD`** and JWTs from **`POST /api/auth/webui/login`**. See [docs/API.md](docs/API.md).\n\nFor the React UI, set `VITE_API_KEY` in `frontend/.env` when using legacy keys (see `frontend/.env.example`). For browser clients, set `API_CORS_ORIGINS` to your frontend origins (comma-separated); if unset, the API uses permissive CORS for local development.\n\nCrawl targets are validated to reduce SSRF risk: only public `http`/`https` URLs are allowed unless you set `ALLOW_PRIVATE_CRAWL_URLS` or list hosts in `CRAWL_ALLOWED_HOSTS`. See `.env.example` for details.\n\n### Running the API\n\nTo start the API server:\n\n```bash\npython run_api.py\n```\n\nor use:\n\n```bash\ncd api\nuvicorn api.main:app --host 0.0.0.0 --port 8001 --reload\n```\n\nThe API will be available at `http://localhost:8001`\n\n### API Endpoints\n\nThe interactive API documentation is available at:\n\n```\nhttp://localhost:8001/docs\n```\n\nThe API provides the following endpoints:\n\n\u003cdetails\u003e\n\u003csummary\u003eClick to expand API endpoints\u003c/summary\u003e\n\n#### Search\n\n- `GET /api/search`: Search for content using semantic search or text search\n  - Parameters:\n    - `query`: The search query\n    - `threshold`: Similarity threshold (0-1)\n    - `limit`: Maximum number of results\n    - `text_only`: Use text search instead of embeddings\n    - `site_id`: Optional site ID to filter results by\n\n#### Crawl\n\n- `POST /api/crawl`: Crawl a website or sitemap\n  - Body:\n    - `url`: URL to crawl\n    - `site_name`: Optional name for the site\n    - `site_description`: Optional description of the site\n    - `is_sitemap`: Whether the URL is a sitemap\n    - `max_urls`: Maximum number of URLs to crawl from a sitemap\n\n- `GET /api/crawl/status/{site_id}`: Get the status of a crawl by site ID\n\n#### Sites\n\n- `GET /api/sites`: List all crawled sites\n  - Parameters:\n    - `include_chunks`: Whether to include chunks in the page count\n\n- `GET /api/sites/{site_id}`: Get a site by ID\n  - Parameters:\n    - `include_chunks`: Whether to include chunks in the page count\n\n- `GET /api/sites/{site_id}/pages`: Get pages for a specific site\n  - Parameters:\n    - `include_chunks`: Whether to include chunks in the results\n    - `limit`: Maximum number of pages to return\n\n#### Chat\n\n- `POST /api/chat`: Send a message to the chat bot and get a response\n  - Body:\n    - `message`: The user's message\n    - `session_id`: Optional session ID for persistent conversations\n    - `user_id`: Optional user ID\n    - `profile`: Optional profile to use\n  - Parameters:\n    - `model`: Optional model to use\n    - `result_limit`: Optional maximum number of search results\n    - `similarity_threshold`: Optional similarity threshold (0-1)\n    - `include_context`: Whether to include search context in the response\n    - `include_history`: Whether to include conversation history in the response\n\n- `GET /api/chat/profiles`: List all available profiles\n  - Parameters:\n    - `session_id`: Optional session ID to get active profile\n    - `user_id`: Optional user ID\n\n- `POST /api/chat/profiles/{profile_name}`: Set the active profile for a session\n  - Parameters:\n    - `session_id`: Session ID\n    - `user_id`: Optional user ID\n\n- `GET /api/chat/history`: Get conversation history for a session\n  - Parameters:\n    - `session_id`: Session ID\n    - `user_id`: Optional user ID\n\n- `DELETE /api/chat/history`: Clear conversation history for a session\n  - Parameters:\n    - `session_id`: Session ID\n    - `user_id`: Optional user ID\n\u003c/details\u003e\n\n### Example API Usage\n\nHere's an example of how to use the API with curl:\n\n```bash\n# Search for content (add -H \"x-api-key: YOUR_KEY\" if SCC_API_KEYS/API_KEYS is configured)\ncurl -X GET \"http://localhost:8001/api/search?query=pydantic\u0026threshold=0.3\u0026limit=5\" -H \"accept: application/json\"\n\n# Start a chat session\ncurl -X POST \"http://localhost:8001/api/chat\" \\\n  -H \"Content-Type: application/json\" \\\n  -d '{\"message\": \"Tell me about pydantic\", \"user_id\": \"example_user\"}'\n\n# Continue the conversation with the same session ID\ncurl -X POST \"http://localhost:8001/api/chat\" \\\n  -H \"Content-Type: application/json\" \\\n  -d '{\"message\": \"How do I use BaseModel?\", \"session_id\": \"SESSION_ID_FROM_PREVIOUS_RESPONSE\", \"user_id\": \"example_user\"}'\n```\nFinished crawl example \n\n![Image](https://github.com/user-attachments/assets/ee12f7f1-1347-4968-9e2b-6466ac835b40)\n\n\n\n\n## License\n\nThis project is licensed under the MIT License - see the LICENSE file for details.","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fbigsk1%2Fsupa-crawl-chat","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Fbigsk1%2Fsupa-crawl-chat","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fbigsk1%2Fsupa-crawl-chat/lists"}