https://github.com/gregpriday/searchsocket

A simple site search for Sveltekit websites.
https://github.com/gregpriday/searchsocket

Last synced: about 2 months ago
JSON representation

A simple site search for Sveltekit websites.

Host: GitHub
URL: https://github.com/gregpriday/searchsocket
Owner: gregpriday
Created: 2026-02-23T05:12:12.000Z (3 months ago)
Default Branch: main
Last Pushed: 2026-03-02T05:40:53.000Z (3 months ago)
Last Synced: 2026-03-31T06:04:28.778Z (2 months ago)
Language: TypeScript
Size: 680 KB
Stars: 0
Watchers: 0
Forks: 0
Open Issues: 20
Metadata Files:
- Readme: README.md
- Changelog: CHANGELOG.md

Awesome Lists containing this project

README

# SearchSocket

Semantic site search and MCP retrieval for SvelteKit content projects.

**Requirements**: Node.js >= 20

## Features

- **Embeddings**: Jina AI `jina-embeddings-v5-text-small` with task-specific LoRA adapters (configurable)
- **Vector Backend**: Turso/libSQL with vector search (local file DB for development, remote for production)
- **Rerank**: Jina `jina-reranker-v3` enabled by default — same API key
- **Page Aggregation**: Group results by page with score-weighted chunk decay
- **Meta Extraction**: Automatically extracts `` and `` for improved relevance
- **SvelteKit Integrations**:
- `searchsocketHandle()` for `POST /api/search` endpoint
- `searchsocketVitePlugin()` for build-triggered indexing
- **Client Library**: `createSearchClient()` for browser-side search, `buildResultUrl()` for scroll-to-section links
- **Scroll-to-Text**: `searchsocketScrollToText()` auto-scrolls to matching sections on navigation
- **MCP Server**: Model Context Protocol tools for search and page retrieval

## Install

```bash
# pnpm
pnpm add -D searchsocket

# npm
npm install -D searchsocket
```

SearchSocket is typically a dev dependency for CLI indexing. If you use `searchsocketHandle()` at runtime (e.g., in a Node server adapter), add it as a regular dependency instead.

## Quickstart

### 1. Initialize

```bash
pnpm searchsocket init
```

This creates:
- `searchsocket.config.ts` — minimal config file
- `.searchsocket/` — state directory (added to `.gitignore`)

### 2. Configure

Minimal config (`searchsocket.config.ts`):

```ts
export default {
embeddings: { apiKeyEnv: "JINA_API_KEY" }
};
```

**That's it!** Turso defaults work out of the box:
- **Development**: Uses local file DB at `.searchsocket/vectors.db`
- **Production**: Set `TURSO_DATABASE_URL` and `TURSO_AUTH_TOKEN` to use remote Turso

### 3. Add SvelteKit API Hook

Create or update `src/hooks.server.ts`:

```ts
import { searchsocketHandle } from "searchsocket/sveltekit";

export const handle = searchsocketHandle();
```

This exposes `POST /api/search` with automatic scope resolution.

### 4. Set Environment Variables

The CLI automatically loads `.env` from the working directory on startup, so your existing `.env` file works out of the box — no wrapper scripts or shell exports needed.

Development (`.env`):
```bash
JINA_API_KEY=jina_...
```

Production (add these for remote Turso):
```bash
JINA_API_KEY=jina_...
TURSO_DATABASE_URL=libsql://your-db.turso.io
TURSO_AUTH_TOKEN=eyJ...
```

### 5. Index Your Content

```bash
pnpm searchsocket index --changed-only
```

SearchSocket auto-detects the source mode based on your config:
- **`static-output`** (default): Reads prerendered HTML from `build/`
- **`build`**: Discovers routes from SvelteKit build manifest and renders via preview server
- **`crawl`**: Fetches pages from a running HTTP server
- **`content-files`**: Reads markdown/svelte source files directly

The indexing pipeline:
- Extracts content from `` (configurable), including `` description and keywords
- Chunks text with semantic heading boundaries
- Prepends page title to each chunk for embedding context
- Generates a synthetic summary chunk per page for identity matching
- Generates embeddings via Jina AI (with task-specific LoRA adapters for indexing vs search)
- Stores vectors in Turso/libSQL with cosine similarity index

### 6. Query

**Via API:**
```bash
curl -X POST http://localhost:5173/api/search \
-H "content-type: application/json" \
-d '{"q":"getting started","topK":5,"groupBy":"page"}'
```

**Via client library:**
```ts
import { createSearchClient } from "searchsocket/client";

const client = createSearchClient(); // defaults to /api/search
const response = await client.search({
q: "getting started",
topK: 5,
groupBy: "page",
pathPrefix: "/docs"
});
```

**Via CLI:**
```bash
pnpm searchsocket search --q "getting started" --top-k 5 --path-prefix /docs
```

**Response** (with `groupBy: "page"`, the default):
```json
{
"q": "getting started",
"scope": "main",
"results": [
{
"url": "/docs/intro",
"title": "Getting Started",
"sectionTitle": "Installation",
"snippet": "Install SearchSocket with pnpm add searchsocket...",
"score": 0.89,
"routeFile": "src/routes/docs/intro/+page.svelte",
"chunks": [
{
"sectionTitle": "Installation",
"snippet": "Install SearchSocket with pnpm add searchsocket...",
"headingPath": ["Getting Started", "Installation"],
"score": 0.89
},
{
"sectionTitle": "Configuration",
"snippet": "Create searchsocket.config.ts with your API key...",
"headingPath": ["Getting Started", "Configuration"],
"score": 0.74
}
]
}
],
"meta": {
"timingsMs": { "embed": 120, "vector": 15, "rerank": 0, "total": 135 },
"usedRerank": false,
"modelId": "jina-embeddings-v5-text-small"
}
}
```

The `chunks` array appears when a page has multiple matching chunks above the `minChunkScoreRatio` threshold. Use `groupBy: "chunk"` for flat per-chunk results without page aggregation.

## Source Modes

SearchSocket supports four source modes for loading pages to index.

### `static-output` (default)

Reads prerendered HTML files from SvelteKit's build output directory.

```ts
export default {
source: {
mode: "static-output",
staticOutputDir: "build"
}
};
```

Best for: Sites with fully prerendered pages. Run `vite build` first, then index.

### `build`

Discovers routes automatically from SvelteKit's build manifest and renders them via an ephemeral `vite preview` server. No manual route configuration needed.

```ts
export default {
source: {
build: {
outputDir: ".svelte-kit/output", // default
previewTimeout: 30000, // ms to wait for server (default)
exclude: ["/api/*", "/admin/*"], // glob patterns to skip
paramValues: { // values for dynamic routes
"/blog/[slug]": ["hello-world", "getting-started"],
"/docs/[category]/[page]": ["guides/quickstart", "api/search"]
},
discover: true, // crawl internal links to find pages (default: false)
seedUrls: ["/"], // starting URLs for discovery
maxPages: 200, // max pages to discover (default: 200)
maxDepth: 5 // max link depth from seed URLs (default: 5)
}
}
};
```

Best for: CI/CD pipelines. Enables `vite build && searchsocket index` with zero route configuration.

**How it works**:
1. Parses `.svelte-kit/output/server/manifest-full.js` to discover all page routes
2. Expands dynamic routes using `paramValues` (skips dynamic routes without values)
3. Starts an ephemeral `vite preview` server on a random port
4. Fetches all routes concurrently for SSR-rendered HTML
5. Provides exact route-to-file mapping (no heuristic matching needed)
6. Shuts down the preview server

**Dynamic routes**: Each key in `paramValues` maps to a route ID (e.g., `/blog/[slug]`) or its URL equivalent. Each value in the array replaces all `[param]` segments in the URL. Routes with layout groups like `/(app)/blog/[slug]` also match the URL key `/blog/[slug]`.

**Link discovery**: Enable `discover: true` to automatically find pages by crawling internal links from `seedUrls`. This is useful when dynamic routes have many parameter values that are impractical to enumerate. The crawler respects `maxPages` and `maxDepth` limits and only follows links within the same origin.

### `crawl`

Fetches pages from a running HTTP server.

```ts
export default {
source: {
crawl: {
baseUrl: "http://localhost:4173",
routes: ["/", "/docs", "/blog"], // explicit routes
sitemapUrl: "https://example.com/sitemap.xml" // or discover via sitemap
}
}
};
```

If `routes` is omitted and no `sitemapUrl` is set, defaults to crawling `["/"]` only.

### `content-files`

Reads markdown and svelte source files directly, without building or serving.

```ts
export default {
source: {
contentFiles: {
globs: ["src/routes/**/*.md", "content/**/*.md"],
baseDir: "."
}
}
};
```

## Client Library

SearchSocket exports a lightweight client for browser-side search:

```ts
import { createSearchClient } from "searchsocket/client";

const client = createSearchClient({
endpoint: "/api/search", // default
fetchImpl: fetch // default; override for SSR or testing
});

const response = await client.search({
q: "deployment guide",
topK: 8,
groupBy: "page",
pathPrefix: "/docs",
tags: ["guide"],
rerank: true
});

for (const result of response.results) {
console.log(result.url, result.title, result.score);
if (result.chunks) {
for (const chunk of result.chunks) {
console.log(" ", chunk.sectionTitle, chunk.score);
}
}
}
```

## Scroll-to-Text Navigation

When a visitor clicks a search result, SearchSocket can automatically scroll them to the relevant section on the destination page. This uses two utilities:

### `buildResultUrl(result)`

Builds a URL from a search result that includes:
- A `_ssk` query parameter for SvelteKit client-side navigation (read by `searchsocketScrollToText`)
- A [Text Fragment](https://developer.mozilla.org/en-US/docs/Web/URI/Fragment/Text_fragments) (`#:~:text=`) for native browser scroll-to-text on full page loads (Chrome 80+, Safari 16.1+, Firefox 131+)

Import from `searchsocket/client`:

```ts
import { createSearchClient, buildResultUrl } from "searchsocket/client";

const client = createSearchClient();
const { results } = await client.search({ q: "installation" });

// Use in your search UI
for (const result of results) {
const href = buildResultUrl(result);
// "/docs/getting-started?_ssk=Installation#:~:text=Installation"
}
```

If the result has no `sectionTitle`, the original URL is returned unchanged.

### `searchsocketScrollToText`

A SvelteKit `afterNavigate` hook that reads the `_ssk` parameter and scrolls the matching heading into view. Add it to your root layout:

```svelte

import { afterNavigate } from '$app/navigation';
import { searchsocketScrollToText } from 'searchsocket/sveltekit';

afterNavigate(searchsocketScrollToText);

```

The hook:
- Matches headings (h1–h6) case-insensitively with whitespace normalization
- Falls back to a broader text node search if no heading matches
- Scrolls smoothly to the first match
- Is a silent no-op when `_ssk` is absent or no match is found

## Vector Backend: Turso/libSQL

SearchSocket uses **Turso** (libSQL) as its single vector backend, providing a unified experience across development and production.

### Local Development

By default, SearchSocket uses a **local file database**:
- Path: `.searchsocket/vectors.db` (configurable)
- No account or API keys needed
- Full vector search with `libsql_vector_idx` and `vector_top_k`
- Perfect for local development and CI testing

### Production (Remote Turso)

For production, switch to **Turso's hosted service**:

1. **Sign up for Turso** (free tier available):
```bash
# Install Turso CLI
brew install tursodatabase/tap/turso

# Sign up
turso auth signup

# Create a database
turso db create searchsocket-prod

# Get credentials
turso db show searchsocket-prod --url
turso db tokens create searchsocket-prod
```

2. **Set environment variables**:
```bash
TURSO_DATABASE_URL=libsql://searchsocket-prod-xxx.turso.io
TURSO_AUTH_TOKEN=eyJhbGc...
```

3. **Index normally** — SearchSocket auto-detects the remote URL and uses it.

### Direct Credential Passing

Instead of environment variables, you can pass credentials directly in the config. This is useful for serverless deployments or multi-tenant setups:

```ts
export default {
embeddings: {
apiKey: "jina_..." // direct API key (takes precedence over apiKeyEnv)
},
vector: {
turso: {
url: "libsql://my-db.turso.io", // direct URL
authToken: "eyJhbGc..." // direct auth token
}
}
};
```

Direct values take precedence over environment variable lookups (`apiKeyEnv`, `urlEnv`, `authTokenEnv`).

### Dimension Mismatch Auto-Recovery

When switching embedding models (e.g., from a 1536-dim model to Jina's 1024-dim), the vector dimension changes. SearchSocket automatically detects this and recreates the chunks table with the new dimension — no manual intervention needed. A full re-index (`--force`) is still required after switching models.

### Why Turso?

- **Single backend** — one unified Turso/libSQL store for vectors, metadata, and state
- **Local-first development** — zero external dependencies for local dev
- **Production-ready** — same codebase scales to remote hosted DB
- **Cost-effective** — Turso free tier includes 9GB storage, 500M row reads/month
- **Vector search native** — `F32_BLOB` vectors, cosine similarity index, `vector_top_k` ANN queries

## Serverless Deployment (Vercel, Netlify, etc.)

SearchSocket works on serverless platforms with a few adjustments:

### Requirements

1. **Remote Turso database** — local SQLite is not available in serverless (no persistent filesystem). Set `TURSO_DATABASE_URL` and `TURSO_AUTH_TOKEN` as platform environment variables.

2. **Inline config via `rawConfig`** — the default config loader uses `jiti` to import `searchsocket.config.ts` from disk, which isn't bundled in serverless. Use `rawConfig` to pass config inline:

```ts
// hooks.server.ts (Vercel / Netlify)
import { searchsocketHandle } from "searchsocket/sveltekit";

export const handle = searchsocketHandle({
rawConfig: {
project: { id: "my-docs-site" },
source: { mode: "static-output" },
embeddings: { apiKeyEnv: "JINA_API_KEY" },
}
});
```

3. **Environment variables** — set these on your platform dashboard:
- `JINA_API_KEY`
- `TURSO_DATABASE_URL`
- `TURSO_AUTH_TOKEN`

### Rate Limiting

The built-in `InMemoryRateLimiter` auto-disables on serverless platforms (it resets on every cold start). Use your platform's WAF or edge rate-limiting instead.

### What Only Applies to Indexing

The following features are only used during `searchsocket index` (CLI), not the search handler:
- `ensureStateDirs` — creates `.searchsocket/` state directories
- Local SQLite fallback — only needed when `TURSO_DATABASE_URL` is not set

### Adapter Guidance

| Platform | Adapter | Notes |
|----------|---------|-------|
| Vercel | `adapter-auto` (default) | Serverless — use `rawConfig` + remote Turso |
| Netlify | `adapter-netlify` | Serverless — same as Vercel |
| VPS / Docker | `adapter-node` | Long-lived process — no limitations, local SQLite works |

## Embeddings: Jina AI

SearchSocket uses **Jina AI's embedding models** to convert text into semantic vectors. A single `JINA_API_KEY` powers both embeddings and optional reranking.

### Default Model

- **Model**: `jina-embeddings-v5-text-small`
- **Dimensions**: 1024 (default)
- **Cost**: ~$0.00005 per 1K tokens
- **Task adapters**: Uses `retrieval.passage` for indexing, `retrieval.query` for search queries (LoRA task-specific adapters for better retrieval quality)

### How It Works

1. **Chunking**: Text is split into semantic chunks (default 2200 chars, 200 overlap)
2. **Title Prepend**: Page title is prepended to each chunk for better context (`chunking.prependTitle`, default: true)
3. **Summary Chunk**: A synthetic identity chunk is generated per page with title, URL, and first paragraph (`chunking.pageSummaryChunk`, default: true)
4. **Embedding**: Each chunk is sent to Jina's embedding API with the `retrieval.passage` task adapter
5. **Batching**: Requests batched (64 texts per request) for efficiency
6. **Storage**: Vectors stored in Turso with metadata (URL, title, tags, depth, etc.)

### Cost Estimation

Use `--dry-run` to preview costs:
```bash
pnpm searchsocket index --dry-run
```

Output:
```
pages processed: 42
chunks total: 156
chunks changed: 156
embeddings created: 156
estimated tokens: 32,400
estimated cost (USD): $0.000648
```

### Reranking

Since embeddings and reranking share the same Jina API key, enabling reranking is one boolean:

```ts
export default {
embeddings: { apiKeyEnv: "JINA_API_KEY" },
rerank: { enabled: true }
};
```

**Note**: Changing the model after indexing requires re-indexing with `--force`.

## Search & Ranking

### Page Aggregation

By default (`groupBy: "page"`), SearchSocket groups chunk results by page URL and computes a page-level score:

1. The top chunk score becomes the base page score
2. Additional matching chunks contribute a decaying bonus: `chunk_score * decay^i`
3. Optional per-URL page weights are applied multiplicatively

Configure aggregation behavior:

```ts
export default {
ranking: {
minScore: 0, // minimum absolute score to include in results (default: 0, disabled)
aggregationCap: 5, // max chunks contributing to page score (default: 5)
aggregationDecay: 0.5, // decay factor for additional chunks (default: 0.5)
minChunkScoreRatio: 0.5, // threshold for sub-chunks in results (default: 0.5)
pageWeights: { // per-URL score multipliers
"/": 1.1,
"/docs": 1.15,
"/download": 1.2
},
weights: {
aggregation: 0.1, // weight of aggregation bonus (default: 0.1)
incomingLinks: 0.05, // incoming link boost weight (default: 0.05)
depth: 0.03, // URL depth boost weight (default: 0.03)
rerank: 1.0 // reranker score weight (default: 1.0)
}
}
};
```

`pageWeights` supports exact URL matches and prefix matching. A weight of `1.15` on `"/docs"` boosts all pages under `/docs/` by 15%. Use gentle values (1.05-1.2x) since they compound with aggregation.

`minScore` filters out low-relevance results before they reach the client. Set to a value like `0.3` to remove noise. In page mode, pages below the threshold are dropped; in chunk mode, individual chunks are filtered. Default is `0` (disabled).

### Chunk Mode

Use `groupBy: "chunk"` for flat per-chunk results without page aggregation:

```bash
curl -X POST http://localhost:5173/api/search \
-H "content-type: application/json" \
-d '{"q":"vector search","topK":10,"groupBy":"chunk"}'
```

## Build-Triggered Indexing

Automatically index after each SvelteKit build.

**`vite.config.ts` or `svelte.config.js`:**
```ts
import { searchsocketVitePlugin } from "searchsocket/sveltekit";

export default {
plugins: [
svelteKitPlugin(),
searchsocketVitePlugin({
enabled: true, // or check process.env.SEARCHSOCKET_AUTO_INDEX
changedOnly: true, // incremental indexing (faster)
verbose: false
})
]
};
```

**Environment control:**
```bash
# Enable via env var
SEARCHSOCKET_AUTO_INDEX=1 pnpm build

# Disable via env var
SEARCHSOCKET_DISABLE_AUTO_INDEX=1 pnpm build
```

## Commands

### `searchsocket init`

Initialize config and state directory.

```bash
pnpm searchsocket init
```

### `searchsocket index`

Index content into vectors.

```bash
# Incremental (only changed chunks)
pnpm searchsocket index --changed-only

# Full re-index
pnpm searchsocket index --force

# Preview cost without indexing
pnpm searchsocket index --dry-run

# Override source mode
pnpm searchsocket index --source build

# Limit for testing
pnpm searchsocket index --max-pages 10 --max-chunks 50

# Override scope
pnpm searchsocket index --scope staging

# Verbose output
pnpm searchsocket index --verbose
```

### `searchsocket status`

Show indexing status, scope, and vector health.

```bash
pnpm searchsocket status

# Output:
# project: my-site
# resolved scope: main
# embedding model: jina-embeddings-v5-text-small
# vector backend: turso/libsql (local (.searchsocket/vectors.db))
# vector health: ok
# last indexed (main): 2025-02-23T10:30:00Z
# tracked chunks: 156
# last estimated tokens: 32,400
# last estimated cost: $0.000648
```

### `searchsocket dev`

Watch for file changes and auto-reindex.

```bash
pnpm searchsocket dev

# With MCP server
pnpm searchsocket dev --mcp --mcp-port 3338
```

Watches:
- `src/routes/**` (route files)
- `build/` (if static-output mode)
- Build output dir (if build mode)
- Content files (if content-files mode)
- `searchsocket.config.ts` (if crawl or build mode)

### `searchsocket clean`

Delete local state and optionally remote vectors.

```bash
# Local state only
pnpm searchsocket clean

# Local + remote vectors
pnpm searchsocket clean --remote --scope staging
```

### `searchsocket prune`

Delete stale scopes (e.g., deleted git branches).

```bash
# Dry run (shows what would be deleted)
pnpm searchsocket prune --older-than 30d

# Apply deletions
pnpm searchsocket prune --older-than 30d --apply

# Use custom scope list
pnpm searchsocket prune --scopes-file active-branches.txt --apply
```

### `searchsocket doctor`

Validate config, env vars, and connectivity.

```bash
pnpm searchsocket doctor

# Output:
# PASS config parse
# PASS env JINA_API_KEY
# PASS turso/libsql (local file: .searchsocket/vectors.db)
# PASS source: build manifest
# PASS source: vite binary
# PASS embedding provider connectivity
# PASS vector backend connectivity
# PASS vector backend write permission
# PASS state directory writable
```

### `searchsocket mcp`

Run MCP server for Claude Desktop / other MCP clients.

```bash
# stdio transport (default)
pnpm searchsocket mcp

# HTTP transport
pnpm searchsocket mcp --transport http --port 3338
```

### `searchsocket search`

CLI search for testing.

```bash
pnpm searchsocket search --q "turso vector search" --top-k 5 --rerank
```

## MCP (Model Context Protocol)

SearchSocket provides an **MCP server** for integration with Claude Code, Claude Desktop, and other MCP-compatible AI tools. This gives AI assistants direct access to your indexed site content for semantic search and page retrieval.

### Tools

**`search(query, opts?)`**
- Semantic search across indexed content
- Returns ranked results with URL, title, snippet, score, and routeFile
- Options: `scope`, `topK` (1-100), `pathPrefix`, `tags`, `groupBy` (`"page"` | `"chunk"`)

**`get_page(pathOrUrl, opts?)`**
- Retrieve full indexed page content as markdown with frontmatter
- Options: `scope`

### Setup (Claude Code)

Add a `.mcp.json` file to your project root (safe to commit — no secrets needed since the CLI auto-loads `.env`):

```json
{
"mcpServers": {
"searchsocket": {
"type": "stdio",
"command": "npx",
"args": ["searchsocket", "mcp"],
"env": {}
}
}
}
```

Restart Claude Code. The `search` and `get_page` tools will be available automatically. Verify with:

```bash
claude mcp list
```

### Setup (Claude Desktop)

Add to `~/Library/Application Support/Claude/claude_desktop_config.json`:

```json
{
"mcpServers": {
"searchsocket": {
"command": "npx",
"args": ["searchsocket", "mcp"],
"cwd": "/path/to/your/project"
}
}
}
```

Restart Claude Desktop. The tools appear in the MCP menu.

### HTTP Transport

For non-stdio clients, run the MCP server over HTTP:

```bash
npx searchsocket mcp --transport http --port 3338
```

This starts a stateless server at `http://127.0.0.1:3338/mcp`. Each POST request creates a fresh server instance with no session persistence.

## Environment Variables

The CLI automatically loads `.env` from the working directory on startup. Existing `process.env` values take precedence over `.env` file values. This only applies to CLI commands (`searchsocket index`, `searchsocket mcp`, etc.) — library imports like `searchsocketHandle()` rely on your framework's own `.env` handling (Vite/SvelteKit).

### Required

**Jina AI:**
- `JINA_API_KEY` — Jina AI API key for embeddings and reranking

### Optional (Turso)

**Remote Turso (production):**
- `TURSO_DATABASE_URL` — Turso database URL (e.g., `libsql://my-db.turso.io`)
- `TURSO_AUTH_TOKEN` — Turso auth token

If not set, uses local file DB at `.searchsocket/vectors.db`.

### Optional (Scope/Build)

- `SEARCHSOCKET_SCOPE` — Override scope (when `scope.mode: "env"`)
- `SEARCHSOCKET_AUTO_INDEX` — Enable build-triggered indexing
- `SEARCHSOCKET_DISABLE_AUTO_INDEX` — Disable build-triggered indexing

## Configuration

### Full Example

```ts
export default {
project: {
id: "my-site",
baseUrl: "https://example.com"
},

scope: {
mode: "git", // "fixed" | "git" | "env"
fixed: "main",
sanitize: true
},

source: {
mode: "build", // "static-output" | "crawl" | "content-files" | "build"
staticOutputDir: "build",
strictRouteMapping: false,

// Build mode (recommended for CI/CD)
build: {
outputDir: ".svelte-kit/output",
previewTimeout: 30000,
exclude: ["/api/*"],
paramValues: {
"/blog/[slug]": ["hello-world", "getting-started"]
},
discover: false,
seedUrls: ["/"],
maxPages: 200,
maxDepth: 5
},

// Crawl mode (alternative)
crawl: {
baseUrl: "http://localhost:4173",
routes: ["/", "/docs", "/blog"],
sitemapUrl: "https://example.com/sitemap.xml"
},

// Content files mode (alternative)
contentFiles: {
globs: ["src/routes/**/*.md"],
baseDir: "."
}
},

extract: {
mainSelector: "main",
dropTags: ["header", "nav", "footer", "aside"],
dropSelectors: [".sidebar", ".toc"],
ignoreAttr: "data-search-ignore",
noindexAttr: "data-search-noindex",
respectRobotsNoindex: true
},

chunking: {
maxChars: 2200,
overlapChars: 200,
minChars: 250,
headingPathDepth: 3,
dontSplitInside: ["code", "table", "blockquote"],
prependTitle: true, // prepend page title to chunk text before embedding
pageSummaryChunk: true // generate synthetic identity chunk per page
},

embeddings: {
provider: "jina",
model: "jina-embeddings-v5-text-small",
apiKey: "jina_...", // direct API key (or use apiKeyEnv)
apiKeyEnv: "JINA_API_KEY",
batchSize: 64,
concurrency: 4
},

vector: {
dimension: 1024, // optional, inferred from first embedding
turso: {
url: "libsql://my-db.turso.io", // direct URL (or use urlEnv)
authToken: "eyJhbGc...", // direct token (or use authTokenEnv)
urlEnv: "TURSO_DATABASE_URL",
authTokenEnv: "TURSO_AUTH_TOKEN",
localPath: ".searchsocket/vectors.db"
}
},

rerank: {
enabled: true,
topN: 20,
model: "jina-reranker-v3"
},

ranking: {
enableIncomingLinkBoost: true,
enableDepthBoost: true,
pageWeights: {
"/": 1.1,
"/docs": 1.15
},
minScore: 0,
aggregationCap: 5,
aggregationDecay: 0.5,
minChunkScoreRatio: 0.5,
weights: {
incomingLinks: 0.05,
depth: 0.03,
rerank: 1.0,
aggregation: 0.1
}
},

api: {
path: "/api/search",
cors: {
allowOrigins: ["https://example.com"]
},
rateLimit: {
windowMs: 60_000,
max: 60
}
}
};
```

## License

MIT

ecosyste.ms

Data

Tools

Indexes

Applications

Experiments

Awesome

https://github.com/gregpriday/searchsocket

Awesome Lists containing this project

README