{"id":29985144,"url":"https://github.com/rootsongjc/rag-chatbot","last_synced_at":"2026-04-28T21:33:54.030Z","repository":{"id":308166670,"uuid":"1031835662","full_name":"rootsongjc/rag-chatbot","owner":"rootsongjc","description":"Build an embeddable RAG Chatbot for your website using Cloudflare Workers.","archived":false,"fork":false,"pushed_at":"2025-08-19T04:29:29.000Z","size":84,"stargazers_count":54,"open_issues_count":2,"forks_count":10,"subscribers_count":0,"default_branch":"main","last_synced_at":"2026-03-28T13:35:52.877Z","etag":null,"topics":["ai","chatbot","cloudflare","cloudflare-workers","hugo","rag","website"],"latest_commit_sha":null,"homepage":"https://jimmysong.io/book/rag-handbook/","language":"TypeScript","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":"apache-2.0","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/rootsongjc.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":null,"license":"LICENSE","code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null,"roadmap":null,"authors":null,"dei":null,"publiccode":null,"codemeta":null,"zenodo":null}},"created_at":"2025-08-04T12:02:40.000Z","updated_at":"2026-02-28T13:38:27.000Z","dependencies_parsed_at":"2025-08-04T17:04:34.238Z","dependency_job_id":"9f5c7630-e798-47d2-8c12-a427629a6a29","html_url":"https://github.com/rootsongjc/rag-chatbot","commit_stats":null,"previous_names":["rootsongjc/rag-chatbot"],"tags_count":0,"template":false,"template_full_name":null,"purl":"pkg:github/rootsongjc/rag-chatbot","repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/rootsongjc%2Frag-chatbot","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/rootsongjc%2Frag-chatbot/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/rootsongjc%2Frag-chatbot/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/rootsongjc%2Frag-chatbot/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/rootsongjc","download_url":"https://codeload.github.com/rootsongjc/rag-chatbot/tar.gz/refs/heads/main","sbom_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/rootsongjc%2Frag-chatbot/sbom","scorecard":null,"host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":286080680,"owners_count":32400865,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2026-04-28T19:38:08.556Z","status":"ssl_error","status_checked_at":"2026-04-28T19:37:55.688Z","response_time":56,"last_error":"SSL_read: unexpected eof while reading","robots_txt_status":"success","robots_txt_updated_at":"2025-07-24T06:49:26.215Z","robots_txt_url":"https://github.com/robots.txt","online":false,"can_crawl_api":true,"host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["ai","chatbot","cloudflare","cloudflare-workers","hugo","rag","website"],"created_at":"2025-08-04T22:01:13.481Z","updated_at":"2026-04-28T21:33:54.022Z","avatar_url":"https://github.com/rootsongjc.png","language":"TypeScript","funding_links":[],"categories":[],"sub_categories":[],"readme":"# Website RAG Chatbot\n\nBuild an embeddable RAG Chatbot for your website using Cloudflare Workers. The JavaScript widget is stored locally for easy maintenance.\nData source: The `content/` directory (Markdown) of your Hugo website repository `website`.\nModel backend: Switchable between Gemini and Qwen (Tongyi Qianwen).\n\n## Features\n\n- Markdown -\u003e Plain text -\u003e Chunking -\u003e Embedding -\u003e Write to Vectorize\n- /chat: Retrieve Top-K chunks + assemble prompt + call LLM to generate Chinese answers\n- Returns source references (source, url)\n- Embeddable frontend `widget.js` for your Hugo site\n\n## Directory Structure\n\nSee the repository tree (`src/`, `scripts/`).\n\n## Prerequisites\n\n1. **Cloudflare account** + `wrangler` installed\n2. **Create a Vectorize index** (dimension consistent with `wrangler.toml`, default 1024), and bind in `wrangler.toml`:\n\n   ```toml\n   [[vectorize]]\n   binding = \"VECTORIZE\"\n   index_name = \"website-rag\"\n   ```\n\n3. **Set Secrets / Vars**\n\n   ```bash\n   wrangler secret put ADMIN_TOKEN # For Vectorize admin API\n   wrangler secret put GOOGLE_API_KEY     # If PROVIDER=gemini\n   wrangler secret put QWEN_API_KEY       # If PROVIDER=qwen\n   # (Optional) wrangler secret put QWEN_BASE\n   # (Optional) wrangler secret put QWEN_EMBED_MODEL\n   ```\n\n   And set in `[vars]` of `wrangler.toml`: `PROVIDER`, `EMBED_DIM`, `LLM_MODEL`.\n\n## Development \u0026 Deployment\n\n1. Install dependencies:\n\n   ```bash\n   npm i\n   ```\n\n2. Local development (Cloudflare login required):\n\n   ```bash\n   npm run dev\n   ```\n\n3. Deploy to Cloudflare:\n\n   ```bash\n   npm run deploy\n   ```\n\n4. Save `widget.js` locally and ensure your site references it:\n   - Reference the local `widget.js` path in your HTML.\n\n## Ingest Your Hugo Content\n\nRun locally (Node 20+):\n\n```bash\n# Example:\nPROVIDER=gemini \\\nGOOGLE_API_KEY=your_google_api_key \\\nADMIN_TOKEN=your_admin_token \\\nWORKER_URL=https://\u003cyour-worker\u003e.workers.dev \\\nCONTENT_DIR=../website/content \\\nBASE_URL=https://your-site.com \\\nEMBED_DIM=1024 \\\nnpm run ingest\n```\n\nOr switch to Qwen:\n\n```bash\nPROVIDER=qwen \\\nQWEN_API_KEY=your_qwen_api_key \\\nADMIN_TOKEN=your_admin_token \\\nWORKER_URL=https://\u003cyour-worker\u003e.workers.dev \\\nCONTENT_DIR=../website/content \\\nBASE_URL=https://your-site.com \\\nEMBED_DIM=1024 \\\nnpm run ingest\n```\n\n\u003e Tip: Ensure the Vectorize index dimension (e.g., 1024) matches the embedding dimension.\n\n## Embed in Your Website\n\nIn your Hugo template (e.g., `layouts/partials/footer.html`), add the following, referencing your local JavaScript path:\n\n```html\n\u003cscript\n  src=\"/path/to/your/local/widget.js\"\n  data-endpoint=\"https://\u003cyour-worker\u003e.workers.dev\"\n  defer\n\u003e\u003c/script\u003e\n```\n\nThis will display the chat widget in the bottom right corner of your site.\n\n## Customization \u0026 Improvements\n\n- **Rerank**: Call a rerank model on retrieval results to improve relevance.\n- **Chunking strategy**: Optimize chunk length based on Chinese punctuation and headings.\n- **Source links**: The mapping in `scripts/ingest.ts`'s `toUrlFromPath` can be further refined with Hugo routing rules.\n- **Conversation memory**: Integrate KV / D1 to store user chat history for summarization and compression.\n\n## Detailed Operation Guide\n\n### Full Reindex\n\nTo completely rebuild the vector index, follow these steps:\n\n1. **Clear the index**:\n\n   ```bash\n   # Use the admin API to clear all vector data\n   curl -X DELETE \"https://\u003cyour-worker-url\u003e/admin/clear-all\" \\\n     -H \"Authorization: Bearer $ADMIN_TOKEN\"\n   ```\n\n2. **Perform full reindex**:\n\n   ```bash\n   npm run full-reindex\n   ```\n\n   This command will automatically clear the database and reindex all content.\n\n3. **ADMIN_TOKEN Permission Notes**:\n   - `ADMIN_TOKEN` authorizes admin operations (clear DB, batch upload, etc.)\n   - Store securely, usually as an environment variable or Cloudflare Secret\n   - Has full DB read/write permissions—keep it safe\n\n### First-Time Initialization\n\nFor first deployment, complete these steps:\n\n1. **Create Vectorize index**:\n\n   ```bash\n   # Create vector index in Cloudflare console\n   # Or use wrangler command (if supported)\n   wrangler vectorize create website-rag --dimensions=1024\n   ```\n\n2. **Configure embedding dimension**:\n   In `wrangler.toml`:\n\n   ```toml\n   [vars]\n   EMBED_DIM = 1024  # Must match Vectorize index dimension\n   PROVIDER = \"qwen\"  # Or \"gemini\"\n   ```\n\n3. **Initial run of indexing**:\n\n   ```bash\n   # Run after configuring all required env variables\n   PROVIDER=qwen \\\n   QWEN_API_KEY=your_qwen_api_key \\\n   ADMIN_TOKEN=your_admin_token \\\n   WORKER_URL=https://\u003cyour-worker\u003e.workers.dev \\\n   CONTENT_DIR=../website/content \\\n   EMBED_DIM=1024 \\\n   npm run ingest\n   ```\n\n### Bilingual Blog Extraction\n\nSupports extracting and updating new Chinese/English bilingual blog content:\n\n1. **Extract new blogs**: Use `manual-ingest.ts` to extract new bilingual blogs, ensuring the vector DB contains the latest content.\n\n   ```bash\n   # Extract a single Chinese blog\n   npm run manual-ingest ../../content/zh/blog/new-post/index.md\n\n   # Extract a single English blog\n   npm run manual-ingest ../../content/en/blog/new-post/index.md\n\n   # Extract both Chinese and English versions\n   npm run manual-ingest ../../content/zh/blog/new-post/index.md ../../content/en/blog/new-post/index.md\n   ```\n\n2. **Update title dictionary**: After adding new bilingual blogs, regenerate the title mapping file to support title translation.\n\n   ```bash\n   npm run generate-titles\n   ```\n\n#### Bilingual Blog Extraction Workflow\n\nOn initialization, the system processes bilingual blogs as follows:\n\n1. **Deduplication strategy**:\n   - Scans all blogs under `content/zh/blog/` and `content/en/blog/`\n   - For blogs with both Chinese and English versions, **Chinese version is prioritized** for vectorization\n   - Only extracts English version if Chinese is absent\n   - Each vector entry includes a `language` metadata field for language filtering during retrieval\n\n2. **Title mapping file generation**:\n   - `generate-title-dictionary.ts` scans all blogs with both Chinese and English versions\n   - Extracts `title` or `Title` from frontmatter\n   - Generates a mapping from Chinese to English titles\n   - Saves as both JSON and TypeScript:\n     - `src/rag/title-dictionary.json`\n     - `src/rag/title-dictionary.ts`\n\n3. **Recommended bilingual blog update workflow**:\n\n   For new bilingual blogs, follow this order:\n\n   ```bash\n   # Step 1: Extract vector data\n   npm run manual-ingest ../../content/zh/blog/new-post/index.md ../../content/en/blog/new-post/index.md\n   # Step 2: Update title dictionary\n   npm run generate-title-dict\n   ```\n\n   This ensures the vector DB has the latest content and title translation works properly.\n\n4. **Language retrieval mechanism**:\n   - On user query, the system filters by current page language (zh/en)\n   - Returns content in the corresponding language first; falls back to all languages if not found\n   - Supports auto-detecting language by URL path (`/en/` for English, others for Chinese)\n\n### Single File Upload\n\nFor updating the index with single or a few files:\n\n1. **Upload a single file with script**:\n\n```bash\n# Upload a specific Markdown file\nnpm run manual-ingest ../website/content/blog/new-post.md\n\n# Upload multiple files\nnpm run manual-ingest file1.md file2.md file3.md\n```\n\n2. **Upload directly via API** (advanced):\n\n   ```bash\n   # Call the Worker's admin API directly\n   curl -X POST \"https://\u003cyour-worker-url\u003e/admin/upsert\" \\\n     -H \"Authorization: Bearer $ADMIN_TOKEN\" \\\n     -H \"Content-Type: application/json\" \\\n     -d '{\n       \"items\": [{\n         \"id\": \"doc-1\",\n         \"vector\": [0.1, 0.2, ...],\n         \"text\": \"Document content\",\n         \"source\": \"blog/example.md\",\n         \"title\": \"Sample Article\",\n         \"url\": \"https://your-site.com/blog/example/\"\n       }]\n     }'\n   ```\n\n### Cloudflare Configuration Details\n\nFull Cloudflare environment setup steps:\n\n1. **Wrangler authentication**:\n\n   ```bash\n   # Log in to Cloudflare\n   wrangler login\n\n   # Verify login status\n   wrangler whoami\n   ```\n\n2. **Configure wrangler.toml**:\n\n   ```toml\n   name = \"website-rag-worker\"\n   main = \"src/worker.ts\"\n   compatibility_date = \"2024-07-01\"\n\n   # Environment variables\n   [vars]\n   PROVIDER = \"qwen\"                    # Model provider: gemini or qwen\n   EMBED_DIM = 1024                     # Embedding dimension, must match index\n   LLM_MODEL = \"qwen-turbo-latest\"      # LLM model name\n   QWEN_EMBED_MODEL = \"text-embedding-v4\"  # Qwen embedding model\n\n   # Vectorize binding\n   [[vectorize]]\n   binding = \"VECTORIZE\"\n   index_name = \"website-rag\"           # Index name\n\n   # Optional: KV storage binding (for chat memory, etc.)\n   # [[kv_namespaces]]\n   # binding = \"CHAT_HISTORY\"\n   # id = \"your-kv-namespace-id\"\n   ```\n\n3. **Set environment variables and secrets**:\n\n   ```bash\n   # Required secrets\n   wrangler secret put ADMIN_TOKEN      # Admin token\n\n   # Set according to PROVIDER\n   wrangler secret put GOOGLE_API_KEY   # Gemini API key\n   wrangler secret put QWEN_API_KEY     # Qwen API key\n\n   # Optional\n   wrangler secret put QWEN_BASE        # Custom Qwen API endpoint\n   ```\n\n4. **Deploy and test**:\n\n   ```bash\n   # Build project\n   npm run build\n\n   # Local test\n   npm run dev\n\n   # Deploy to production\n   npm run deploy\n   ```\n\n5. **Billing notes**:\n   - **Vectorize**: Billed by vector count and query times\n   - **Workers**: Billed by request count and CPU time\n   - **KV** (if used): Billed by storage and operation count\n   - Monitor usage and set appropriate limits\n   - Free tier available for development/testing\n\n### Troubleshooting\n\n1. **Embedding dimension mismatch**:\n   Ensure `EMBED_DIM` matches the Vectorize index dimension\n\n2. **API quota exceeded**:\n   Adjust `MAX_CONCURRENT_EMBEDDINGS` and batch size parameters\n\n3. **Permission errors**:\n   Check if `ADMIN_TOKEN` is set correctly and not expired\n\n4. **Network proxy**:\n   Set `https_proxy` environment variable if needed\n\n## License\n\nApache License, Version 2.0\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Frootsongjc%2Frag-chatbot","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Frootsongjc%2Frag-chatbot","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Frootsongjc%2Frag-chatbot/lists"}