https://github.com/commoncrawl/cc-vec

Last synced: 5 months ago
JSON representation
Host: GitHub
URL: https://github.com/commoncrawl/cc-vec
Owner: commoncrawl
License: mit
Created: 2025-09-15T16:39:21.000Z (9 months ago)
Default Branch: main
Last Pushed: 2026-01-13T00:05:33.000Z (6 months ago)
Last Synced: 2026-01-13T03:18:51.672Z (6 months ago)
Language: Python
Size: 151 KB
Stars: 5
Watchers: 1
Forks: 2
Open Issues: 0
Metadata Files:
- Readme: README.md
- License: LICENSE
Awesome Lists containing this project

README

          # CCVec - Common Crawl to Vector Stores

Search, analyze, and index Common Crawl data into vector stores for RAG applications. Three surfaces available:

* CLI

* Python library

* MCP server

## Quick Start

**Environment variables:**

- **`ATHENA_OUTPUT_BUCKET`** - Required S3 bucket for Athena query results (needed for reliable queries to Common Crawl metadata)

- **`AWS_ACCESS_KEY_ID`** - Required for Athena/S3 access (needed to run Athena queries)

- **`AWS_SECRET_ACCESS_KEY`** - Required for Athena/S3 access (needed to run Athena queries)

- **`AWS_SESSION_TOKEN`** - Optional for Athena/S3 access (needed to run Athena queries). This is required for temporary credentials

- **`OPENAI_API_KEY`** - Required for vector operations (index, query, list)

- `OPENAI_BASE_URL` - Optional custom OpenAI endpoint (e.g., `http://localhost:8321/v1` for Llama Stack)

- `OPENAI_EMBEDDING_MODEL` - Embedding model to use (e.g., `text-embedding-3-small`, `ollama/nomic-embed-text:latest`)

- `OPENAI_EMBEDDING_DIMENSIONS` - Embedding dimensions (optional, model-specific)

- `AWS_DEFAULT_REGION` - AWS region (defaults to us-west-2)

- `LOG_LEVEL` - Logging level (defaults to INFO)

**Note:** Uses SQL wildcards (`%`) not glob patterns (`*`) for URL matching.

## 1. ⌨️ Command Line

```bash

# Search Common Crawl index

uv run cc-vec search --url-patterns "%.github.io" --limit 10

# Get statistics

uv run cc-vec stats --url-patterns "%.edu"

# Fetch and process content (returns clean text)

uv run cc-vec fetch --url-patterns "%.example.com" --limit 5

# Advanced filtering - multiple filters can be combined

uv run cc-vec fetch --url-patterns "%.github.io" --status-codes "200,201" --mime-types "text/html" --limit 10

# Filter by hostname instead of pattern

uv run cc-vec search --url-host-names "github.io,github.com" --limit 10

# Filter by TLD for better performance (uses indexed column)

uv run cc-vec search --url-host-tlds "edu,gov" --limit 20

# Filter by registered domain (uses indexed column)

uv run cc-vec search --url-host-registered-domains "github.com,example.com" --limit 15

# Filter by URL path (for specific site sections)

uv run cc-vec search --url-host-names "github.io" --url-paths "/blog/%,/docs/%" --limit 10

# Query across multiple Common Crawl datasets

uv run cc-vec search --url-patterns "%.edu" --crawl-ids "CC-MAIN-2024-33,CC-MAIN-2024-30" --limit 20

# List available Common Crawl datasets

uv run cc-vec list-crawls

# List all available filter columns (no API keys needed)

uv run cc-vec list-filter-columns

uv run cc-vec list-filter-columns --output json

# Vector operations (require OPENAI_API_KEY)

# Create vector store with processed content (OpenAI handles chunking with token limits)

uv run cc-vec index --url-patterns "%.github.io" --vector-store-name "ml-research" --limit 50 --chunk-size 800 --overlap 400

# Vector store name is optional - will auto-generate if not provided

uv run cc-vec index --url-patterns "%.github.io" --limit 50

# List cc-vec vector stores (default - only shows stores created by cc-vec)

uv run cc-vec list --output json

# List ALL vector stores (including non-cc-vec stores)

uv run cc-vec list --all

# Query vector store by ID for RAG

uv run cc-vec query "What is machine learning?" --vector-store-id "vs-123abc" --limit 5

# Query vector store by name

uv run cc-vec query "Explain deep learning" --vector-store-name "ml-research" --limit 3

```

## 1.5. Local Llama Stack Setup (Optional)

Run cc-vec with local models using Ollama + Llama Stack. This provides a fully local version.

**Step 1: Install and Start Ollama**

```bash

# Install Ollama (macOS/Linux)

curl -fsSL https://ollama.com/install.sh | sh

# Start Ollama server

ollama serve &

# Pull required models

ollama pull llama3.2:3b                        # Inference model

ollama pull nomic-embed-text                    # Embedding model (768 dimensions)

```

**Step 2: Start ChromaDB (Optional - for persistent vector storage)**

The starter distribution uses in-memory FAISS by default. For persistent storage, run ChromaDB:

```bash

# Install and run ChromaDB

uv run --with chromadb chroma run --host localhost --port 8000 --path ./chroma_data

```

**Step 3: Start Llama Stack**

```bash

# With explicit Ollama URL (and faiss in-memory vectory store)

uv run --with 'llama-stack>=0.4.0' llama stack run starter --port 8321 \

  --env OLLAMA_URL=http://localhost:11434

# With ChromaDB for persistent vector storage (if running ChromaDB from Step 2)

uv run --with 'llama-stack>=0.4.0' llama stack run starter --port 8321 \

  --env OLLAMA_URL=http://localhost:11434 \

  --env CHROMADB_URL=http://localhost:8000

```

This starts the Llama Stack server at `http://localhost:8321`.

**Step 4: Use with cc-vec**

```bash

# Set environment variables

export OPENAI_BASE_URL=http://localhost:8321/v1

export OPENAI_API_KEY=none # Llama Stack doesn't require a real key

export OPENAI_EMBEDDING_MODEL=ollama/nomic-embed-text:latest

export OPENAI_EMBEDDING_DIMENSIONS=768

# Set your Athena credentials

export ATHENA_OUTPUT_BUCKET=s3://your-bucket/

export AWS_ACCESS_KEY_ID=your-key

export AWS_SECRET_ACCESS_KEY=your-secret

# Use cc-vec with local models

uv run cc-vec index --url-patterns "%.edu" --limit 10

```

**Documentation:**

- [Llama Stack Docs](https://llamastack.github.io/)

- [Llama Stack GitHub](https://github.com/meta-llama/llama-stack)

- [Ollama Models](https://ollama.com/library)

- [ChromaDB Docs](https://docs.trychroma.com/)

## 2. 📦 Python Library

```python

import os

from cc_vec import (

    search,

    stats,

    fetch,

    index,

    list_vector_stores,

    query_vector_store,

    list_crawls,

    FilterConfig,

    VectorStoreConfig,

)

# For alternative endpoints, set environment variables before importing

# Example: Using Ollama

# os.environ["OPENAI_BASE_URL"] = "http://localhost:11434/v1"

# os.environ["OPENAI_API_KEY"] = "ollama"

# os.environ["OPENAI_EMBEDDING_MODEL"] = "ollama/nomic-embed-text:latest"

# os.environ["OPENAI_EMBEDDING_DIMENSIONS"] = "768"

# Example: Using Llama Stack

# os.environ["OPENAI_BASE_URL"] = "http://localhost:8321/v1"

# os.environ["OPENAI_API_KEY"] = "your-llama-stack-key"

# Basic search and stats (no OpenAI key needed)

filter_config = FilterConfig(url_patterns=["%.github.io"])

stats_response = stats(filter_config)

print(f"Estimated records: {stats_response.estimated_records:,}")

print(f"Estimated size: {stats_response.estimated_size_mb:.2f} MB")

print(f"Athena cost: ${stats_response.estimated_cost_usd:.4f}")

results = search(filter_config, limit=10)

print(f"Found {len(results)} URLs")

for result in results[:3]:

    print(f"  {result.url} (Status: {result.status})")

# Advanced filtering - multiple criteria

filter_config = FilterConfig(

    url_patterns=["%.github.io", "%.github.com"],

    url_host_names=["github.io"],

    url_host_tlds=["io", "com"],  # Filter by TLD (uses indexed column)

    url_host_registered_domains=["github.com"],  # Filter by domain (uses indexed column)

    url_paths=["/blog/%", "/docs/%"],  # Filter by URL path

    crawl_ids=["CC-MAIN-2024-33", "CC-MAIN-2024-30"],  # Query multiple crawls

    status_codes=[200, 201],

    mime_types=["text/html"],

    charsets=["utf-8"],

    languages=["en"],

)

results = search(filter_config, limit=20)

print(f"Found {len(results)} URLs matching filters")

# Using indexed columns for better performance

filter_config = FilterConfig(

    url_host_tlds=["edu", "gov"],  # Much faster than url_patterns=["%.edu", "%.gov"]

    status_codes=[200],

)

results = search(filter_config, limit=50)

print(f"Found {len(results)} .edu and .gov sites")

# Fetch and process content (returns clean text)

filter_config = FilterConfig(url_patterns=["%.example.com"])

content_results = fetch(filter_config, limit=2)

print(f"Processed {len(content_results)} content records")

for record, processed in content_results:

    if processed:

        print(f"  {record.url}: {processed['word_count']} words")

        print(f"    Title: {processed.get('title', 'N/A')}")

# List available Common Crawl datasets

crawls = list_crawls()

print(f"Available crawls: {len(crawls)}")

print(f"Latest: {crawls[0]}")

# Index data in a vector store

filter_config = FilterConfig(url_patterns=["%.github.io"])

vector_config = VectorStoreConfig(

    name="ml-research",

    chunk_size=800,

    overlap=400,

    embedding_model="text-embedding-3-small",

    embedding_dimensions=1536,

)

result = index(filter_config, vector_config, limit=50)

print(f"Created vector store: {result['vector_store_name']}")

print(f"Vector Store ID: {result['vector_store_id']}")

print(f"Processed records: {result['total_fetched']}")

# List cc-vec vector stores (default - only shows stores created by cc-vec)

stores = list_vector_stores()

print(f"Available stores: {len(stores)}")

for store in stores[:3]:

    print(f"  {store['name']} (ID: {store['id']}, Status: {store['status']})")

# List ALL vector stores (including non-cc-vec stores)

all_stores = list_vector_stores(cc_vec_only=False)

print(f"All stores: {len(all_stores)}")

# Query vector store for RAG

query_results = query_vector_store("vs-123abc", "What is machine learning?", limit=5)

print(f"Query found {len(query_results.get('results', []))} relevant results")

for i, result in enumerate(query_results.get("results", []), 1):

    print(f"  {i}. Score: {result.get('score', 0):.3f}")

    print(f"     Content: {result.get('content', '')[:100]}...")

    print(f"     File: {result.get('file_id', 'N/A')}")

```

## 3. 🔌 MCP Server (Claude Desktop)

**Setup:**

1. Copy and edit the config: `cp claude_desktop_config.json ~/Library/Application\ Support/Claude/claude_desktop_config.json`

2. Update the directory path and API key in the config file

3. Restart Claude Desktop

The config uses stdio mode (required by Claude Desktop):

```json

{

  "mcpServers": {

    "cc-vec": {

      "command": "uv",

      "args": ["run", "--directory", "your-path-to-the-repo", "cc-vec", "mcp-serve", "--mode", "stdio"],

      "env": {

        "ATHENA_OUTPUT_BUCKET": "your-athena-output-bucket",

        "OPENAI_API_KEY": "your-openai-api-key-here"

        // "OPENAI_BASE_URL": "http://localhost:11434/v1"   // Optional: Use for Ollama, Llama Stack, or other endpoints

        // "OPENAI_EMBEDDING_MODEL": "ollama/nomic-embed-text:latest"     // Optional: Specify custom embedding model

        // "OPENAI_EMBEDDING_DIMENSIONS": "768"              // Optional: Specify embedding dimensions

      }

    }

  }

}

```

**Available MCP tools:**

```

# Search and analysis (no OpenAI key needed)

cc_search - Search Common Crawl for URLs matching patterns with advanced filtering

cc_stats - Get statistics and cost estimates for patterns with advanced filtering

cc_fetch - Download actual content from matched URLs with advanced filtering

cc_list_crawls - List available Common Crawl dataset IDs

# Vector operations (require OPENAI_API_KEY)

cc_index - Create and populate vector stores from Common Crawl content with chunking config

cc_list_vector_stores - List OpenAI vector stores (defaults to cc-vec created only)

cc_query - Query vector stores for relevant content

```

**Example usage in Claude Desktop:**

- "Use cc_search to find GitHub Pages sites: url_pattern=%.github.io, limit=10"

- "Use cc_stats to analyze education sites: url_pattern=%.edu"

- "Use cc_search with indexed columns for better performance: url_host_tlds=['edu', 'gov'], limit=20"

- "Use cc_search with registered domains: url_host_registered_domains=['github.com'], limit=15"

- "Use cc_search for specific paths: url_host_names=['github.io'], url_paths=['/blog/%'], limit=10"

- "Use cc_search across multiple crawls: url_pattern=%.edu, crawl_ids=['CC-MAIN-2024-33', 'CC-MAIN-2024-30']"

- "Use cc_fetch to get content: url_host_names=['github.io'], limit=5"

- "Use cc_list_crawls to show available Common Crawl datasets"

- "Use cc_index to create vector store: vector_store_name=research, url_pattern=%.arxiv.org, limit=100, chunk_size=800"

- "Use cc_list_vector_stores to show cc-vec stores (default)"

- "Use cc_list_vector_stores with cc_vec_only=false to show all vector stores"

- "Use cc_query to search: vector_store_id=vs-123, query=machine learning"

**Note:** All filter options available in CLI (shown via `cc-vec list-filter-columns`) are also available in MCP tools.

## License

MIT
ecosyste.ms

Data

Tools

Indexes

Applications

Experiments

Awesome

https://github.com/commoncrawl/cc-vec

Awesome Lists containing this project

README