https://github.com/gregpriday/searchsocket
A simple site search for Sveltekit websites.
https://github.com/gregpriday/searchsocket
Last synced: about 2 months ago
JSON representation
A simple site search for Sveltekit websites.
- Host: GitHub
- URL: https://github.com/gregpriday/searchsocket
- Owner: gregpriday
- Created: 2026-02-23T05:12:12.000Z (3 months ago)
- Default Branch: main
- Last Pushed: 2026-03-02T05:40:53.000Z (3 months ago)
- Last Synced: 2026-03-31T06:04:28.778Z (2 months ago)
- Language: TypeScript
- Size: 680 KB
- Stars: 0
- Watchers: 0
- Forks: 0
- Open Issues: 20
-
Metadata Files:
- Readme: README.md
- Changelog: CHANGELOG.md
Awesome Lists containing this project
README
# SearchSocket
Semantic site search and MCP retrieval for SvelteKit content projects.
**Requirements**: Node.js >= 20
## Features
- **Embeddings**: Jina AI `jina-embeddings-v5-text-small` with task-specific LoRA adapters (configurable)
- **Vector Backend**: Turso/libSQL with vector search (local file DB for development, remote for production)
- **Rerank**: Jina `jina-reranker-v3` enabled by default — same API key
- **Page Aggregation**: Group results by page with score-weighted chunk decay
- **Meta Extraction**: Automatically extracts `` and `` for improved relevance
- **SvelteKit Integrations**:
- `searchsocketHandle()` for `POST /api/search` endpoint
- `searchsocketVitePlugin()` for build-triggered indexing
- **Client Library**: `createSearchClient()` for browser-side search, `buildResultUrl()` for scroll-to-section links
- **Scroll-to-Text**: `searchsocketScrollToText()` auto-scrolls to matching sections on navigation
- **MCP Server**: Model Context Protocol tools for search and page retrieval
## Install
```bash
# pnpm
pnpm add -D searchsocket
# npm
npm install -D searchsocket
```
SearchSocket is typically a dev dependency for CLI indexing. If you use `searchsocketHandle()` at runtime (e.g., in a Node server adapter), add it as a regular dependency instead.
## Quickstart
### 1. Initialize
```bash
pnpm searchsocket init
```
This creates:
- `searchsocket.config.ts` — minimal config file
- `.searchsocket/` — state directory (added to `.gitignore`)
### 2. Configure
Minimal config (`searchsocket.config.ts`):
```ts
export default {
embeddings: { apiKeyEnv: "JINA_API_KEY" }
};
```
**That's it!** Turso defaults work out of the box:
- **Development**: Uses local file DB at `.searchsocket/vectors.db`
- **Production**: Set `TURSO_DATABASE_URL` and `TURSO_AUTH_TOKEN` to use remote Turso
### 3. Add SvelteKit API Hook
Create or update `src/hooks.server.ts`:
```ts
import { searchsocketHandle } from "searchsocket/sveltekit";
export const handle = searchsocketHandle();
```
This exposes `POST /api/search` with automatic scope resolution.
### 4. Set Environment Variables
The CLI automatically loads `.env` from the working directory on startup, so your existing `.env` file works out of the box — no wrapper scripts or shell exports needed.
Development (`.env`):
```bash
JINA_API_KEY=jina_...
```
Production (add these for remote Turso):
```bash
JINA_API_KEY=jina_...
TURSO_DATABASE_URL=libsql://your-db.turso.io
TURSO_AUTH_TOKEN=eyJ...
```
### 5. Index Your Content
```bash
pnpm searchsocket index --changed-only
```
SearchSocket auto-detects the source mode based on your config:
- **`static-output`** (default): Reads prerendered HTML from `build/`
- **`build`**: Discovers routes from SvelteKit build manifest and renders via preview server
- **`crawl`**: Fetches pages from a running HTTP server
- **`content-files`**: Reads markdown/svelte source files directly
The indexing pipeline:
- Extracts content from `` (configurable), including `` description and keywords
- Chunks text with semantic heading boundaries
- Prepends page title to each chunk for embedding context
- Generates a synthetic summary chunk per page for identity matching
- Generates embeddings via Jina AI (with task-specific LoRA adapters for indexing vs search)
- Stores vectors in Turso/libSQL with cosine similarity index
### 6. Query
**Via API:**
```bash
curl -X POST http://localhost:5173/api/search \
-H "content-type: application/json" \
-d '{"q":"getting started","topK":5,"groupBy":"page"}'
```
**Via client library:**
```ts
import { createSearchClient } from "searchsocket/client";
const client = createSearchClient(); // defaults to /api/search
const response = await client.search({
q: "getting started",
topK: 5,
groupBy: "page",
pathPrefix: "/docs"
});
```
**Via CLI:**
```bash
pnpm searchsocket search --q "getting started" --top-k 5 --path-prefix /docs
```
**Response** (with `groupBy: "page"`, the default):
```json
{
"q": "getting started",
"scope": "main",
"results": [
{
"url": "/docs/intro",
"title": "Getting Started",
"sectionTitle": "Installation",
"snippet": "Install SearchSocket with pnpm add searchsocket...",
"score": 0.89,
"routeFile": "src/routes/docs/intro/+page.svelte",
"chunks": [
{
"sectionTitle": "Installation",
"snippet": "Install SearchSocket with pnpm add searchsocket...",
"headingPath": ["Getting Started", "Installation"],
"score": 0.89
},
{
"sectionTitle": "Configuration",
"snippet": "Create searchsocket.config.ts with your API key...",
"headingPath": ["Getting Started", "Configuration"],
"score": 0.74
}
]
}
],
"meta": {
"timingsMs": { "embed": 120, "vector": 15, "rerank": 0, "total": 135 },
"usedRerank": false,
"modelId": "jina-embeddings-v5-text-small"
}
}
```
The `chunks` array appears when a page has multiple matching chunks above the `minChunkScoreRatio` threshold. Use `groupBy: "chunk"` for flat per-chunk results without page aggregation.
## Source Modes
SearchSocket supports four source modes for loading pages to index.
### `static-output` (default)
Reads prerendered HTML files from SvelteKit's build output directory.
```ts
export default {
source: {
mode: "static-output",
staticOutputDir: "build"
}
};
```
Best for: Sites with fully prerendered pages. Run `vite build` first, then index.
### `build`
Discovers routes automatically from SvelteKit's build manifest and renders them via an ephemeral `vite preview` server. No manual route configuration needed.
```ts
export default {
source: {
build: {
outputDir: ".svelte-kit/output", // default
previewTimeout: 30000, // ms to wait for server (default)
exclude: ["/api/*", "/admin/*"], // glob patterns to skip
paramValues: { // values for dynamic routes
"/blog/[slug]": ["hello-world", "getting-started"],
"/docs/[category]/[page]": ["guides/quickstart", "api/search"]
},
discover: true, // crawl internal links to find pages (default: false)
seedUrls: ["/"], // starting URLs for discovery
maxPages: 200, // max pages to discover (default: 200)
maxDepth: 5 // max link depth from seed URLs (default: 5)
}
}
};
```
Best for: CI/CD pipelines. Enables `vite build && searchsocket index` with zero route configuration.
**How it works**:
1. Parses `.svelte-kit/output/server/manifest-full.js` to discover all page routes
2. Expands dynamic routes using `paramValues` (skips dynamic routes without values)
3. Starts an ephemeral `vite preview` server on a random port
4. Fetches all routes concurrently for SSR-rendered HTML
5. Provides exact route-to-file mapping (no heuristic matching needed)
6. Shuts down the preview server
**Dynamic routes**: Each key in `paramValues` maps to a route ID (e.g., `/blog/[slug]`) or its URL equivalent. Each value in the array replaces all `[param]` segments in the URL. Routes with layout groups like `/(app)/blog/[slug]` also match the URL key `/blog/[slug]`.
**Link discovery**: Enable `discover: true` to automatically find pages by crawling internal links from `seedUrls`. This is useful when dynamic routes have many parameter values that are impractical to enumerate. The crawler respects `maxPages` and `maxDepth` limits and only follows links within the same origin.
### `crawl`
Fetches pages from a running HTTP server.
```ts
export default {
source: {
crawl: {
baseUrl: "http://localhost:4173",
routes: ["/", "/docs", "/blog"], // explicit routes
sitemapUrl: "https://example.com/sitemap.xml" // or discover via sitemap
}
}
};
```
If `routes` is omitted and no `sitemapUrl` is set, defaults to crawling `["/"]` only.
### `content-files`
Reads markdown and svelte source files directly, without building or serving.
```ts
export default {
source: {
contentFiles: {
globs: ["src/routes/**/*.md", "content/**/*.md"],
baseDir: "."
}
}
};
```
## Client Library
SearchSocket exports a lightweight client for browser-side search:
```ts
import { createSearchClient } from "searchsocket/client";
const client = createSearchClient({
endpoint: "/api/search", // default
fetchImpl: fetch // default; override for SSR or testing
});
const response = await client.search({
q: "deployment guide",
topK: 8,
groupBy: "page",
pathPrefix: "/docs",
tags: ["guide"],
rerank: true
});
for (const result of response.results) {
console.log(result.url, result.title, result.score);
if (result.chunks) {
for (const chunk of result.chunks) {
console.log(" ", chunk.sectionTitle, chunk.score);
}
}
}
```
## Scroll-to-Text Navigation
When a visitor clicks a search result, SearchSocket can automatically scroll them to the relevant section on the destination page. This uses two utilities:
### `buildResultUrl(result)`
Builds a URL from a search result that includes:
- A `_ssk` query parameter for SvelteKit client-side navigation (read by `searchsocketScrollToText`)
- A [Text Fragment](https://developer.mozilla.org/en-US/docs/Web/URI/Fragment/Text_fragments) (`#:~:text=`) for native browser scroll-to-text on full page loads (Chrome 80+, Safari 16.1+, Firefox 131+)
Import from `searchsocket/client`:
```ts
import { createSearchClient, buildResultUrl } from "searchsocket/client";
const client = createSearchClient();
const { results } = await client.search({ q: "installation" });
// Use in your search UI
for (const result of results) {
const href = buildResultUrl(result);
// "/docs/getting-started?_ssk=Installation#:~:text=Installation"
}
```
If the result has no `sectionTitle`, the original URL is returned unchanged.
### `searchsocketScrollToText`
A SvelteKit `afterNavigate` hook that reads the `_ssk` parameter and scrolls the matching heading into view. Add it to your root layout:
```svelte
import { afterNavigate } from '$app/navigation';
import { searchsocketScrollToText } from 'searchsocket/sveltekit';
afterNavigate(searchsocketScrollToText);
```
The hook:
- Matches headings (h1–h6) case-insensitively with whitespace normalization
- Falls back to a broader text node search if no heading matches
- Scrolls smoothly to the first match
- Is a silent no-op when `_ssk` is absent or no match is found
## Vector Backend: Turso/libSQL
SearchSocket uses **Turso** (libSQL) as its single vector backend, providing a unified experience across development and production.
### Local Development
By default, SearchSocket uses a **local file database**:
- Path: `.searchsocket/vectors.db` (configurable)
- No account or API keys needed
- Full vector search with `libsql_vector_idx` and `vector_top_k`
- Perfect for local development and CI testing
### Production (Remote Turso)
For production, switch to **Turso's hosted service**:
1. **Sign up for Turso** (free tier available):
```bash
# Install Turso CLI
brew install tursodatabase/tap/turso
# Sign up
turso auth signup
# Create a database
turso db create searchsocket-prod
# Get credentials
turso db show searchsocket-prod --url
turso db tokens create searchsocket-prod
```
2. **Set environment variables**:
```bash
TURSO_DATABASE_URL=libsql://searchsocket-prod-xxx.turso.io
TURSO_AUTH_TOKEN=eyJhbGc...
```
3. **Index normally** — SearchSocket auto-detects the remote URL and uses it.
### Direct Credential Passing
Instead of environment variables, you can pass credentials directly in the config. This is useful for serverless deployments or multi-tenant setups:
```ts
export default {
embeddings: {
apiKey: "jina_..." // direct API key (takes precedence over apiKeyEnv)
},
vector: {
turso: {
url: "libsql://my-db.turso.io", // direct URL
authToken: "eyJhbGc..." // direct auth token
}
}
};
```
Direct values take precedence over environment variable lookups (`apiKeyEnv`, `urlEnv`, `authTokenEnv`).
### Dimension Mismatch Auto-Recovery
When switching embedding models (e.g., from a 1536-dim model to Jina's 1024-dim), the vector dimension changes. SearchSocket automatically detects this and recreates the chunks table with the new dimension — no manual intervention needed. A full re-index (`--force`) is still required after switching models.
### Why Turso?
- **Single backend** — one unified Turso/libSQL store for vectors, metadata, and state
- **Local-first development** — zero external dependencies for local dev
- **Production-ready** — same codebase scales to remote hosted DB
- **Cost-effective** — Turso free tier includes 9GB storage, 500M row reads/month
- **Vector search native** — `F32_BLOB` vectors, cosine similarity index, `vector_top_k` ANN queries
## Serverless Deployment (Vercel, Netlify, etc.)
SearchSocket works on serverless platforms with a few adjustments:
### Requirements
1. **Remote Turso database** — local SQLite is not available in serverless (no persistent filesystem). Set `TURSO_DATABASE_URL` and `TURSO_AUTH_TOKEN` as platform environment variables.
2. **Inline config via `rawConfig`** — the default config loader uses `jiti` to import `searchsocket.config.ts` from disk, which isn't bundled in serverless. Use `rawConfig` to pass config inline:
```ts
// hooks.server.ts (Vercel / Netlify)
import { searchsocketHandle } from "searchsocket/sveltekit";
export const handle = searchsocketHandle({
rawConfig: {
project: { id: "my-docs-site" },
source: { mode: "static-output" },
embeddings: { apiKeyEnv: "JINA_API_KEY" },
}
});
```
3. **Environment variables** — set these on your platform dashboard:
- `JINA_API_KEY`
- `TURSO_DATABASE_URL`
- `TURSO_AUTH_TOKEN`
### Rate Limiting
The built-in `InMemoryRateLimiter` auto-disables on serverless platforms (it resets on every cold start). Use your platform's WAF or edge rate-limiting instead.
### What Only Applies to Indexing
The following features are only used during `searchsocket index` (CLI), not the search handler:
- `ensureStateDirs` — creates `.searchsocket/` state directories
- Local SQLite fallback — only needed when `TURSO_DATABASE_URL` is not set
### Adapter Guidance
| Platform | Adapter | Notes |
|----------|---------|-------|
| Vercel | `adapter-auto` (default) | Serverless — use `rawConfig` + remote Turso |
| Netlify | `adapter-netlify` | Serverless — same as Vercel |
| VPS / Docker | `adapter-node` | Long-lived process — no limitations, local SQLite works |
## Embeddings: Jina AI
SearchSocket uses **Jina AI's embedding models** to convert text into semantic vectors. A single `JINA_API_KEY` powers both embeddings and optional reranking.
### Default Model
- **Model**: `jina-embeddings-v5-text-small`
- **Dimensions**: 1024 (default)
- **Cost**: ~$0.00005 per 1K tokens
- **Task adapters**: Uses `retrieval.passage` for indexing, `retrieval.query` for search queries (LoRA task-specific adapters for better retrieval quality)
### How It Works
1. **Chunking**: Text is split into semantic chunks (default 2200 chars, 200 overlap)
2. **Title Prepend**: Page title is prepended to each chunk for better context (`chunking.prependTitle`, default: true)
3. **Summary Chunk**: A synthetic identity chunk is generated per page with title, URL, and first paragraph (`chunking.pageSummaryChunk`, default: true)
4. **Embedding**: Each chunk is sent to Jina's embedding API with the `retrieval.passage` task adapter
5. **Batching**: Requests batched (64 texts per request) for efficiency
6. **Storage**: Vectors stored in Turso with metadata (URL, title, tags, depth, etc.)
### Cost Estimation
Use `--dry-run` to preview costs:
```bash
pnpm searchsocket index --dry-run
```
Output:
```
pages processed: 42
chunks total: 156
chunks changed: 156
embeddings created: 156
estimated tokens: 32,400
estimated cost (USD): $0.000648
```
### Reranking
Since embeddings and reranking share the same Jina API key, enabling reranking is one boolean:
```ts
export default {
embeddings: { apiKeyEnv: "JINA_API_KEY" },
rerank: { enabled: true }
};
```
**Note**: Changing the model after indexing requires re-indexing with `--force`.
## Search & Ranking
### Page Aggregation
By default (`groupBy: "page"`), SearchSocket groups chunk results by page URL and computes a page-level score:
1. The top chunk score becomes the base page score
2. Additional matching chunks contribute a decaying bonus: `chunk_score * decay^i`
3. Optional per-URL page weights are applied multiplicatively
Configure aggregation behavior:
```ts
export default {
ranking: {
minScore: 0, // minimum absolute score to include in results (default: 0, disabled)
aggregationCap: 5, // max chunks contributing to page score (default: 5)
aggregationDecay: 0.5, // decay factor for additional chunks (default: 0.5)
minChunkScoreRatio: 0.5, // threshold for sub-chunks in results (default: 0.5)
pageWeights: { // per-URL score multipliers
"/": 1.1,
"/docs": 1.15,
"/download": 1.2
},
weights: {
aggregation: 0.1, // weight of aggregation bonus (default: 0.1)
incomingLinks: 0.05, // incoming link boost weight (default: 0.05)
depth: 0.03, // URL depth boost weight (default: 0.03)
rerank: 1.0 // reranker score weight (default: 1.0)
}
}
};
```
`pageWeights` supports exact URL matches and prefix matching. A weight of `1.15` on `"/docs"` boosts all pages under `/docs/` by 15%. Use gentle values (1.05-1.2x) since they compound with aggregation.
`minScore` filters out low-relevance results before they reach the client. Set to a value like `0.3` to remove noise. In page mode, pages below the threshold are dropped; in chunk mode, individual chunks are filtered. Default is `0` (disabled).
### Chunk Mode
Use `groupBy: "chunk"` for flat per-chunk results without page aggregation:
```bash
curl -X POST http://localhost:5173/api/search \
-H "content-type: application/json" \
-d '{"q":"vector search","topK":10,"groupBy":"chunk"}'
```
## Build-Triggered Indexing
Automatically index after each SvelteKit build.
**`vite.config.ts` or `svelte.config.js`:**
```ts
import { searchsocketVitePlugin } from "searchsocket/sveltekit";
export default {
plugins: [
svelteKitPlugin(),
searchsocketVitePlugin({
enabled: true, // or check process.env.SEARCHSOCKET_AUTO_INDEX
changedOnly: true, // incremental indexing (faster)
verbose: false
})
]
};
```
**Environment control:**
```bash
# Enable via env var
SEARCHSOCKET_AUTO_INDEX=1 pnpm build
# Disable via env var
SEARCHSOCKET_DISABLE_AUTO_INDEX=1 pnpm build
```
## Commands
### `searchsocket init`
Initialize config and state directory.
```bash
pnpm searchsocket init
```
### `searchsocket index`
Index content into vectors.
```bash
# Incremental (only changed chunks)
pnpm searchsocket index --changed-only
# Full re-index
pnpm searchsocket index --force
# Preview cost without indexing
pnpm searchsocket index --dry-run
# Override source mode
pnpm searchsocket index --source build
# Limit for testing
pnpm searchsocket index --max-pages 10 --max-chunks 50
# Override scope
pnpm searchsocket index --scope staging
# Verbose output
pnpm searchsocket index --verbose
```
### `searchsocket status`
Show indexing status, scope, and vector health.
```bash
pnpm searchsocket status
# Output:
# project: my-site
# resolved scope: main
# embedding model: jina-embeddings-v5-text-small
# vector backend: turso/libsql (local (.searchsocket/vectors.db))
# vector health: ok
# last indexed (main): 2025-02-23T10:30:00Z
# tracked chunks: 156
# last estimated tokens: 32,400
# last estimated cost: $0.000648
```
### `searchsocket dev`
Watch for file changes and auto-reindex.
```bash
pnpm searchsocket dev
# With MCP server
pnpm searchsocket dev --mcp --mcp-port 3338
```
Watches:
- `src/routes/**` (route files)
- `build/` (if static-output mode)
- Build output dir (if build mode)
- Content files (if content-files mode)
- `searchsocket.config.ts` (if crawl or build mode)
### `searchsocket clean`
Delete local state and optionally remote vectors.
```bash
# Local state only
pnpm searchsocket clean
# Local + remote vectors
pnpm searchsocket clean --remote --scope staging
```
### `searchsocket prune`
Delete stale scopes (e.g., deleted git branches).
```bash
# Dry run (shows what would be deleted)
pnpm searchsocket prune --older-than 30d
# Apply deletions
pnpm searchsocket prune --older-than 30d --apply
# Use custom scope list
pnpm searchsocket prune --scopes-file active-branches.txt --apply
```
### `searchsocket doctor`
Validate config, env vars, and connectivity.
```bash
pnpm searchsocket doctor
# Output:
# PASS config parse
# PASS env JINA_API_KEY
# PASS turso/libsql (local file: .searchsocket/vectors.db)
# PASS source: build manifest
# PASS source: vite binary
# PASS embedding provider connectivity
# PASS vector backend connectivity
# PASS vector backend write permission
# PASS state directory writable
```
### `searchsocket mcp`
Run MCP server for Claude Desktop / other MCP clients.
```bash
# stdio transport (default)
pnpm searchsocket mcp
# HTTP transport
pnpm searchsocket mcp --transport http --port 3338
```
### `searchsocket search`
CLI search for testing.
```bash
pnpm searchsocket search --q "turso vector search" --top-k 5 --rerank
```
## MCP (Model Context Protocol)
SearchSocket provides an **MCP server** for integration with Claude Code, Claude Desktop, and other MCP-compatible AI tools. This gives AI assistants direct access to your indexed site content for semantic search and page retrieval.
### Tools
**`search(query, opts?)`**
- Semantic search across indexed content
- Returns ranked results with URL, title, snippet, score, and routeFile
- Options: `scope`, `topK` (1-100), `pathPrefix`, `tags`, `groupBy` (`"page"` | `"chunk"`)
**`get_page(pathOrUrl, opts?)`**
- Retrieve full indexed page content as markdown with frontmatter
- Options: `scope`
### Setup (Claude Code)
Add a `.mcp.json` file to your project root (safe to commit — no secrets needed since the CLI auto-loads `.env`):
```json
{
"mcpServers": {
"searchsocket": {
"type": "stdio",
"command": "npx",
"args": ["searchsocket", "mcp"],
"env": {}
}
}
}
```
Restart Claude Code. The `search` and `get_page` tools will be available automatically. Verify with:
```bash
claude mcp list
```
### Setup (Claude Desktop)
Add to `~/Library/Application Support/Claude/claude_desktop_config.json`:
```json
{
"mcpServers": {
"searchsocket": {
"command": "npx",
"args": ["searchsocket", "mcp"],
"cwd": "/path/to/your/project"
}
}
}
```
Restart Claude Desktop. The tools appear in the MCP menu.
### HTTP Transport
For non-stdio clients, run the MCP server over HTTP:
```bash
npx searchsocket mcp --transport http --port 3338
```
This starts a stateless server at `http://127.0.0.1:3338/mcp`. Each POST request creates a fresh server instance with no session persistence.
## Environment Variables
The CLI automatically loads `.env` from the working directory on startup. Existing `process.env` values take precedence over `.env` file values. This only applies to CLI commands (`searchsocket index`, `searchsocket mcp`, etc.) — library imports like `searchsocketHandle()` rely on your framework's own `.env` handling (Vite/SvelteKit).
### Required
**Jina AI:**
- `JINA_API_KEY` — Jina AI API key for embeddings and reranking
### Optional (Turso)
**Remote Turso (production):**
- `TURSO_DATABASE_URL` — Turso database URL (e.g., `libsql://my-db.turso.io`)
- `TURSO_AUTH_TOKEN` — Turso auth token
If not set, uses local file DB at `.searchsocket/vectors.db`.
### Optional (Scope/Build)
- `SEARCHSOCKET_SCOPE` — Override scope (when `scope.mode: "env"`)
- `SEARCHSOCKET_AUTO_INDEX` — Enable build-triggered indexing
- `SEARCHSOCKET_DISABLE_AUTO_INDEX` — Disable build-triggered indexing
## Configuration
### Full Example
```ts
export default {
project: {
id: "my-site",
baseUrl: "https://example.com"
},
scope: {
mode: "git", // "fixed" | "git" | "env"
fixed: "main",
sanitize: true
},
source: {
mode: "build", // "static-output" | "crawl" | "content-files" | "build"
staticOutputDir: "build",
strictRouteMapping: false,
// Build mode (recommended for CI/CD)
build: {
outputDir: ".svelte-kit/output",
previewTimeout: 30000,
exclude: ["/api/*"],
paramValues: {
"/blog/[slug]": ["hello-world", "getting-started"]
},
discover: false,
seedUrls: ["/"],
maxPages: 200,
maxDepth: 5
},
// Crawl mode (alternative)
crawl: {
baseUrl: "http://localhost:4173",
routes: ["/", "/docs", "/blog"],
sitemapUrl: "https://example.com/sitemap.xml"
},
// Content files mode (alternative)
contentFiles: {
globs: ["src/routes/**/*.md"],
baseDir: "."
}
},
extract: {
mainSelector: "main",
dropTags: ["header", "nav", "footer", "aside"],
dropSelectors: [".sidebar", ".toc"],
ignoreAttr: "data-search-ignore",
noindexAttr: "data-search-noindex",
respectRobotsNoindex: true
},
chunking: {
maxChars: 2200,
overlapChars: 200,
minChars: 250,
headingPathDepth: 3,
dontSplitInside: ["code", "table", "blockquote"],
prependTitle: true, // prepend page title to chunk text before embedding
pageSummaryChunk: true // generate synthetic identity chunk per page
},
embeddings: {
provider: "jina",
model: "jina-embeddings-v5-text-small",
apiKey: "jina_...", // direct API key (or use apiKeyEnv)
apiKeyEnv: "JINA_API_KEY",
batchSize: 64,
concurrency: 4
},
vector: {
dimension: 1024, // optional, inferred from first embedding
turso: {
url: "libsql://my-db.turso.io", // direct URL (or use urlEnv)
authToken: "eyJhbGc...", // direct token (or use authTokenEnv)
urlEnv: "TURSO_DATABASE_URL",
authTokenEnv: "TURSO_AUTH_TOKEN",
localPath: ".searchsocket/vectors.db"
}
},
rerank: {
enabled: true,
topN: 20,
model: "jina-reranker-v3"
},
ranking: {
enableIncomingLinkBoost: true,
enableDepthBoost: true,
pageWeights: {
"/": 1.1,
"/docs": 1.15
},
minScore: 0,
aggregationCap: 5,
aggregationDecay: 0.5,
minChunkScoreRatio: 0.5,
weights: {
incomingLinks: 0.05,
depth: 0.03,
rerank: 1.0,
aggregation: 0.1
}
},
api: {
path: "/api/search",
cors: {
allowOrigins: ["https://example.com"]
},
rateLimit: {
windowMs: 60_000,
max: 60
}
}
};
```
## License
MIT