https://github.com/superdoc-dev/docx-corpus

The largest open corpus of .docx files for document processing research
https://github.com/superdoc-dev/docx-corpus

bun common-crawl corpus dataset document-processing docx machine-learning nlp typescript word-documents

Last synced: 2 months ago
JSON representation

The largest open corpus of .docx files for document processing research

Host: GitHub
URL: https://github.com/superdoc-dev/docx-corpus
Owner: superdoc-dev
License: mit
Created: 2026-01-09T11:10:21.000Z (5 months ago)
Default Branch: main
Last Pushed: 2026-03-09T21:10:20.000Z (3 months ago)
Last Synced: 2026-03-09T22:46:50.581Z (3 months ago)
Topics: bun, common-crawl, corpus, dataset, document-processing, docx, machine-learning, nlp, typescript, word-documents
Language: TypeScript
Homepage: https://docxcorp.us
Size: 1.24 MB
Stars: 45
Watchers: 2
Forks: 1
Open Issues: 0
Metadata Files:
- Readme: README.md
- License: LICENSE
- Agents: AGENTS.md

Awesome Lists containing this project

README

logo

[![CLI](https://img.shields.io/github/v/release/superdoc-dev/docx-corpus?filter=cli-v*&label=cli)](https://github.com/superdoc-dev/docx-corpus/releases)
[![CDX Filter](https://img.shields.io/github/v/release/superdoc-dev/docx-corpus?filter=cdx-filter-v*&label=cdx-filter)](https://github.com/superdoc-dev/docx-corpus/releases)
[![codecov](https://codecov.io/gh/superdoc-dev/docx-corpus/graph/badge.svg)](https://codecov.io/gh/superdoc-dev/docx-corpus)
[![License: MIT](https://img.shields.io/badge/License-MIT-yellow.svg)](https://opensource.org/licenses/MIT)

The largest open corpus of classified Word documents. 736K+ `.docx` files from the public web, classified into 10 document types and 9 topics across 46+ languages.

**[docxcorp.us](https://docxcorp.us)** · **[HuggingFace](https://huggingface.co/datasets/superdoc-dev/docx-corpus)** · **[API](https://api.docxcorp.us/stats)**

## How It Works

```
Common Crawl (3B+ URLs/month)
↓
[1. cdx-filter] AWS Lambda — filters CDX indexes for .docx URLs
↓
[2. scrape] Download WARC records, validate, deduplicate, store
↓
[3. extract] Extract text + detect language (Docling + lingua)
↓
[4. classify] Classify by type + topic (ModernBERT, FineWeb-Edu pattern)
↓
[5. export] Push to HuggingFace / serve via API
```

## Quick Start

```bash
git clone https://github.com/superdoc-dev/docx-corpus.git
cd docx-corpus
bun install
```

## CLI

All pipeline stages are accessible through a single CLI:

```bash
corpus cdx-filter # Show available vs filtered crawls
corpus cdx-filter --crawl CC-MAIN-2026-08 # Filter a specific crawl via Lambda
corpus cdx-filter --latest 3 # Filter 3 newest missing crawls
corpus crawls # List available crawls from R2
corpus scrape --crawl CC-MAIN-2025-51 # Scrape a specific crawl
corpus scrape --crawl 3 --batch 100 # Latest 3 crawls, 100 docs each
corpus extract # Extract text from all pending
corpus extract -b 100 -w 8 # Custom batch size + workers
corpus classify # Classify all pending documents
corpus classify --modal --workers 20 # Cloud GPU classification
corpus export # Export parquet locally
corpus export --push # Push to HuggingFace
corpus status # Show full pipeline stats
```

Run `corpus --help` for detailed options.

## Project Structure

```
apps/
cli/ # Unified CLI — corpus
cdx-filter/ # AWS Lambda — filters CDX indexes for .docx URLs
web/ # Landing page (docxcorp.us) + Cloudflare Worker API
packages/
shared/ # DB client, storage abstraction, formatting
scraper/ # Downloads WARC, validates .docx, deduplicates
extractor/ # Text extraction via Docling (Bun + Python)
embedder/ # Document embeddings via Gemini
scripts/
classification/ # ML classification pipeline (Python)
export-hf.py # HuggingFace dataset export
db/
schema.sql # PostgreSQL + pgvector schema
migrations/ # Database migrations
```

| Layer | What | Runtime |
|-------|------|---------|
| **cli** | `corpus` command — orchestrates everything | Bun |
| **cdx-filter** | Filter Common Crawl CDX indexes (Lambda) | Node.js |
| **web** | docxcorp.us landing page + API worker | Static + CF Worker |
| **scraper** | Download, validate, deduplicate .docx files | Bun |
| **extractor** | Extract text + detect language (Docling) | Bun + Python |
| **embedder** | Generate embeddings (Gemini) | Bun |
| **classification** | Type + topic classification (ModernBERT) | Python |

## Pipeline Details

### 1. CDX Filtering (Lambda)

Pre-filters Common Crawl CDX indexes for `.docx` URLs. Runs in AWS Lambda (us-east-1) for direct S3 access — minutes instead of days.

```bash
corpus cdx-filter # Show what's available vs filtered
corpus cdx-filter --crawl CC-MAIN-2026-08 # Filter one crawl
corpus cdx-filter --all # Filter all missing crawls
```

**AWS setup**: The Lambda function needs AWS credentials configured locally. See [apps/cdx-filter/README.md](apps/cdx-filter/README.md) for Lambda deployment.

```bash
# Option 1: AWS CLI profile (recommended)
aws configure --profile docx-corpus
export AWS_PROFILE=docx-corpus

# Option 2: Environment variables
export AWS_ACCESS_KEY_ID=...
export AWS_SECRET_ACCESS_KEY=...
export AWS_REGION=us-east-1
```

The AWS IAM user/role needs `lambda:InvokeFunction` permission on the `cdx-filter` function.

### 2. Scraping

Downloads WARC records from Common Crawl, validates ZIP structure, computes SHA-256 hash, deduplicates, and stores to R2/local filesystem.

```bash
corpus scrape --crawl CC-MAIN-2025-51 --batch 500
corpus scrape --crawl 3 # Latest 3 crawls
corpus scrape --crawl CC-MAIN-2025-51 --force # Re-process existing
```

- Adaptive rate limiting (backs off on 503/429, recovers on success)
- Content-addressed storage (`documents/{sha256}.docx`)
- Deduplication by content hash

### 3. Extraction

Extracts text using Docling (persistent Python subprocess), detects language with lingua.

```bash
corpus extract # All pending documents
corpus extract -b 100 -w 8 # Custom batch + workers
```

- Smart table handling (avoids padding bloat)
- Updates: `word_count`, `char_count`, `table_count`, `image_count`, `language`

### 4. Classification

Classifies documents by **type** (10 classes) and **topic** (9 classes) using the [FineWeb-Edu](https://huggingface.co/spaces/HuggingFaceFW/blogpost-fineweb-v1) pattern: LLM labels a sample → train lightweight classifier → apply at scale.

```bash
corpus classify # Local classification
corpus classify --modal --workers 20 # Cloud GPUs via Modal
corpus classify -l en,ru --batch-size 256 # Filter + custom batch
```

**First-time setup** (training):

```bash
cd scripts/classification
pip install -e .
python sample.py --total 3500 --output sampled_docs.jsonl
python label.py --input sampled_docs.jsonl --output labeled_docs.jsonl
python train.py --input labeled_docs.jsonl --output-dir ./models
```

See [scripts/classification/CLAUDE.md](scripts/classification/CLAUDE.md) for details.

**Document types**: legal, forms, reports, policies, educational, correspondence, technical, administrative, creative, reference

**Topics**: government, education, healthcare, finance, legal_judicial, technology, environment, nonprofit, general

### 5. Export

Export corpus metadata to HuggingFace as a Parquet dataset.

```bash
corpus export # Dry run: local parquet
corpus export --push # Push to HuggingFace
```

### 6. Embedding (optional)

Generate vector embeddings for semantic search. Not required for the website or classification.

```bash
corpus embed # All extracted documents
corpus embed --batch 100 # With batch limit
```

Uses Google Gemini `gemini-embedding-001` (3072 dimensions).

## Web & API

**[docxcorp.us](https://docxcorp.us)** — Browse, filter, and preview documents with SuperDoc.

**API** (Cloudflare Worker):

```bash
# Corpus stats
curl https://api.docxcorp.us/stats

# Search documents with faceted filtering
curl "https://api.docxcorp.us/documents?type=legal&lang=en&min_confidence=0.8"

# Download manifest (wget-compatible URL list)
curl "https://api.docxcorp.us/manifest?type=legal&lang=en" -o manifest.txt
wget -i manifest.txt -P ./corpus/
```

## Configuration

All via environment variables (`.env`):

```bash
# Database (required)
DATABASE_URL=postgres://user:pass@host:5432/dbname

# Cloudflare R2 (required for cloud storage)
CLOUDFLARE_ACCOUNT_ID=
R2_ACCESS_KEY_ID=
R2_SECRET_ACCESS_KEY=
R2_BUCKET_NAME=docx-corpus

# Local storage fallback
STORAGE_PATH=./corpus

# Embeddings (optional)
GOOGLE_API_KEY=

# AWS (for cdx-filter Lambda invocation)
AWS_PROFILE=docx-corpus # or set AWS_ACCESS_KEY_ID + AWS_SECRET_ACCESS_KEY

# Classification (for LLM labeling step only)
ANTHROPIC_API_KEY=
```

## Local Development

```bash
# Start local PostgreSQL + pgvector
docker compose up -d

# Run against local database
DATABASE_URL=postgres://postgres:postgres@localhost:5432/docx_corpus \
bun run corpus status

# Run web API locally
cd apps/web/worker
npx wrangler dev
```

## Docker

```bash
docker build -t docx-corpus .
docker run -e DATABASE_URL=postgres://... docx-corpus scrape --batch 100
```

## Takedown Requests

If you find a document you own and would like removed, email [help@docxcorp.us](mailto:help@docxcorp.us) with the document hash or URL and proof of ownership. Processed within 7 days.

## License

MIT

---

Built by 🦋 [SuperDoc](https://superdoc.dev)

ecosyste.ms

Data

Tools

Indexes

Applications

Experiments

Awesome

https://github.com/superdoc-dev/docx-corpus

Awesome Lists containing this project

README