{"id":36693547,"url":"https://github.com/superdoc-dev/docx-corpus","last_synced_at":"2026-03-12T11:10:48.619Z","repository":{"id":332072096,"uuid":"1131026593","full_name":"superdoc-dev/docx-corpus","owner":"superdoc-dev","description":"The largest open corpus of .docx files for document processing research","archived":false,"fork":false,"pushed_at":"2026-03-09T21:10:20.000Z","size":1303,"stargazers_count":45,"open_issues_count":0,"forks_count":1,"subscribers_count":2,"default_branch":"main","last_synced_at":"2026-03-09T22:46:50.581Z","etag":null,"topics":["bun","common-crawl","corpus","dataset","document-processing","docx","machine-learning","nlp","typescript","word-documents"],"latest_commit_sha":null,"homepage":"https://docxcorp.us","language":"TypeScript","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":"mit","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/superdoc-dev.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":null,"license":"LICENSE","code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null,"roadmap":null,"authors":null,"dei":null,"publiccode":null,"codemeta":null,"zenodo":null,"notice":null,"maintainers":null,"copyright":null,"agents":"AGENTS.md","dco":null,"cla":null}},"created_at":"2026-01-09T11:10:21.000Z","updated_at":"2026-03-09T21:10:17.000Z","dependencies_parsed_at":null,"dependency_job_id":null,"html_url":"https://github.com/superdoc-dev/docx-corpus","commit_stats":null,"previous_names":["superdoc-dev/docx-corpus"],"tags_count":96,"template":false,"template_full_name":null,"purl":"pkg:github/superdoc-dev/docx-corpus","repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/superdoc-dev%2Fdocx-corpus","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/superdoc-dev%2Fdocx-corpus/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/superdoc-dev%2Fdocx-corpus/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/superdoc-dev%2Fdocx-corpus/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/superdoc-dev","download_url":"https://codeload.github.com/superdoc-dev/docx-corpus/tar.gz/refs/heads/main","sbom_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/superdoc-dev%2Fdocx-corpus/sbom","scorecard":null,"host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":286080680,"owners_count":30316096,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2026-03-09T20:05:46.299Z","status":"ssl_error","status_checked_at":"2026-03-09T19:57:04.425Z","response_time":61,"last_error":"SSL_connect returned=1 errno=0 peeraddr=140.82.121.6:443 state=error: unexpected eof while reading","robots_txt_status":"success","robots_txt_updated_at":"2025-07-24T06:49:26.215Z","robots_txt_url":"https://github.com/robots.txt","online":false,"can_crawl_api":true,"host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["bun","common-crawl","corpus","dataset","document-processing","docx","machine-learning","nlp","typescript","word-documents"],"created_at":"2026-01-12T11:24:42.231Z","updated_at":"2026-03-12T11:10:46.217Z","avatar_url":"https://github.com/superdoc-dev.png","language":"TypeScript","funding_links":[],"categories":[],"sub_categories":[],"readme":"\u003cimg width=\"400\" alt=\"logo\" src=\"https://github.com/user-attachments/assets/ea105e9e-00d0-4d48-a2a4-006cc4e89848\" /\u003e\n\n[![CLI](https://img.shields.io/github/v/release/superdoc-dev/docx-corpus?filter=cli-v*\u0026label=cli)](https://github.com/superdoc-dev/docx-corpus/releases)\n[![CDX Filter](https://img.shields.io/github/v/release/superdoc-dev/docx-corpus?filter=cdx-filter-v*\u0026label=cdx-filter)](https://github.com/superdoc-dev/docx-corpus/releases)\n[![codecov](https://codecov.io/gh/superdoc-dev/docx-corpus/graph/badge.svg)](https://codecov.io/gh/superdoc-dev/docx-corpus)\n[![License: MIT](https://img.shields.io/badge/License-MIT-yellow.svg)](https://opensource.org/licenses/MIT)\n\nThe largest open corpus of classified Word documents. 736K+ `.docx` files from the public web, classified into 10 document types and 9 topics across 46+ languages.\n\n**[docxcorp.us](https://docxcorp.us)** · **[HuggingFace](https://huggingface.co/datasets/superdoc-dev/docx-corpus)** · **[API](https://api.docxcorp.us/stats)**\n\n## How It Works\n\n```\nCommon Crawl (3B+ URLs/month)\n    ↓\n[1. cdx-filter]  AWS Lambda — filters CDX indexes for .docx URLs\n    ↓\n[2. scrape]      Download WARC records, validate, deduplicate, store\n    ↓\n[3. extract]     Extract text + detect language (Docling + lingua)\n    ↓\n[4. classify]    Classify by type + topic (ModernBERT, FineWeb-Edu pattern)\n    ↓\n[5. export]      Push to HuggingFace / serve via API\n```\n\n## Quick Start\n\n```bash\ngit clone https://github.com/superdoc-dev/docx-corpus.git\ncd docx-corpus\nbun install\n```\n\n## CLI\n\nAll pipeline stages are accessible through a single CLI:\n\n```bash\ncorpus cdx-filter                         # Show available vs filtered crawls\ncorpus cdx-filter --crawl CC-MAIN-2026-08 # Filter a specific crawl via Lambda\ncorpus cdx-filter --latest 3              # Filter 3 newest missing crawls\ncorpus crawls                              # List available crawls from R2\ncorpus scrape --crawl CC-MAIN-2025-51      # Scrape a specific crawl\ncorpus scrape --crawl 3 --batch 100        # Latest 3 crawls, 100 docs each\ncorpus extract                             # Extract text from all pending\ncorpus extract -b 100 -w 8                 # Custom batch size + workers\ncorpus classify                            # Classify all pending documents\ncorpus classify --modal --workers 20       # Cloud GPU classification\ncorpus export                              # Export parquet locally\ncorpus export --push                       # Push to HuggingFace\ncorpus status                              # Show full pipeline stats\n```\n\nRun `corpus \u003ccommand\u003e --help` for detailed options.\n\n## Project Structure\n\n```\napps/\n  cli/              # Unified CLI — corpus \u003ccommand\u003e\n  cdx-filter/       # AWS Lambda — filters CDX indexes for .docx URLs\n  web/              # Landing page (docxcorp.us) + Cloudflare Worker API\npackages/\n  shared/           # DB client, storage abstraction, formatting\n  scraper/          # Downloads WARC, validates .docx, deduplicates\n  extractor/        # Text extraction via Docling (Bun + Python)\n  embedder/         # Document embeddings via Gemini\nscripts/\n  classification/   # ML classification pipeline (Python)\n  export-hf.py      # HuggingFace dataset export\ndb/\n  schema.sql        # PostgreSQL + pgvector schema\n  migrations/       # Database migrations\n```\n\n| Layer | What | Runtime |\n|-------|------|---------|\n| **cli** | `corpus` command — orchestrates everything | Bun |\n| **cdx-filter** | Filter Common Crawl CDX indexes (Lambda) | Node.js |\n| **web** | docxcorp.us landing page + API worker | Static + CF Worker |\n| **scraper** | Download, validate, deduplicate .docx files | Bun |\n| **extractor** | Extract text + detect language (Docling) | Bun + Python |\n| **embedder** | Generate embeddings (Gemini) | Bun |\n| **classification** | Type + topic classification (ModernBERT) | Python |\n\n## Pipeline Details\n\n### 1. CDX Filtering (Lambda)\n\nPre-filters Common Crawl CDX indexes for `.docx` URLs. Runs in AWS Lambda (us-east-1) for direct S3 access — minutes instead of days.\n\n```bash\ncorpus cdx-filter                          # Show what's available vs filtered\ncorpus cdx-filter --crawl CC-MAIN-2026-08  # Filter one crawl\ncorpus cdx-filter --all                    # Filter all missing crawls\n```\n\n**AWS setup**: The Lambda function needs AWS credentials configured locally. See [apps/cdx-filter/README.md](apps/cdx-filter/README.md) for Lambda deployment.\n\n```bash\n# Option 1: AWS CLI profile (recommended)\naws configure --profile docx-corpus\nexport AWS_PROFILE=docx-corpus\n\n# Option 2: Environment variables\nexport AWS_ACCESS_KEY_ID=...\nexport AWS_SECRET_ACCESS_KEY=...\nexport AWS_REGION=us-east-1\n```\n\nThe AWS IAM user/role needs `lambda:InvokeFunction` permission on the `cdx-filter` function.\n\n### 2. Scraping\n\nDownloads WARC records from Common Crawl, validates ZIP structure, computes SHA-256 hash, deduplicates, and stores to R2/local filesystem.\n\n```bash\ncorpus scrape --crawl CC-MAIN-2025-51 --batch 500\ncorpus scrape --crawl 3                  # Latest 3 crawls\ncorpus scrape --crawl CC-MAIN-2025-51 --force  # Re-process existing\n```\n\n- Adaptive rate limiting (backs off on 503/429, recovers on success)\n- Content-addressed storage (`documents/{sha256}.docx`)\n- Deduplication by content hash\n\n### 3. Extraction\n\nExtracts text using Docling (persistent Python subprocess), detects language with lingua.\n\n```bash\ncorpus extract                    # All pending documents\ncorpus extract -b 100 -w 8       # Custom batch + workers\n```\n\n- Smart table handling (avoids padding bloat)\n- Updates: `word_count`, `char_count`, `table_count`, `image_count`, `language`\n\n### 4. Classification\n\nClassifies documents by **type** (10 classes) and **topic** (9 classes) using the [FineWeb-Edu](https://huggingface.co/spaces/HuggingFaceFW/blogpost-fineweb-v1) pattern: LLM labels a sample → train lightweight classifier → apply at scale.\n\n```bash\ncorpus classify                            # Local classification\ncorpus classify --modal --workers 20       # Cloud GPUs via Modal\ncorpus classify -l en,ru --batch-size 256  # Filter + custom batch\n```\n\n**First-time setup** (training):\n\n```bash\ncd scripts/classification\npip install -e .\npython sample.py --total 3500 --output sampled_docs.jsonl\npython label.py --input sampled_docs.jsonl --output labeled_docs.jsonl\npython train.py --input labeled_docs.jsonl --output-dir ./models\n```\n\nSee [scripts/classification/CLAUDE.md](scripts/classification/CLAUDE.md) for details.\n\n**Document types**: legal, forms, reports, policies, educational, correspondence, technical, administrative, creative, reference\n\n**Topics**: government, education, healthcare, finance, legal_judicial, technology, environment, nonprofit, general\n\n### 5. Export\n\nExport corpus metadata to HuggingFace as a Parquet dataset.\n\n```bash\ncorpus export                    # Dry run: local parquet\ncorpus export --push             # Push to HuggingFace\n```\n\n### 6. Embedding (optional)\n\nGenerate vector embeddings for semantic search. Not required for the website or classification.\n\n```bash\ncorpus embed                     # All extracted documents\ncorpus embed --batch 100         # With batch limit\n```\n\nUses Google Gemini `gemini-embedding-001` (3072 dimensions).\n\n## Web \u0026 API\n\n**[docxcorp.us](https://docxcorp.us)** — Browse, filter, and preview documents with SuperDoc.\n\n**API** (Cloudflare Worker):\n\n```bash\n# Corpus stats\ncurl https://api.docxcorp.us/stats\n\n# Search documents with faceted filtering\ncurl \"https://api.docxcorp.us/documents?type=legal\u0026lang=en\u0026min_confidence=0.8\"\n\n# Download manifest (wget-compatible URL list)\ncurl \"https://api.docxcorp.us/manifest?type=legal\u0026lang=en\" -o manifest.txt\nwget -i manifest.txt -P ./corpus/\n```\n\n## Configuration\n\nAll via environment variables (`.env`):\n\n```bash\n# Database (required)\nDATABASE_URL=postgres://user:pass@host:5432/dbname\n\n# Cloudflare R2 (required for cloud storage)\nCLOUDFLARE_ACCOUNT_ID=\nR2_ACCESS_KEY_ID=\nR2_SECRET_ACCESS_KEY=\nR2_BUCKET_NAME=docx-corpus\n\n# Local storage fallback\nSTORAGE_PATH=./corpus\n\n# Embeddings (optional)\nGOOGLE_API_KEY=\n\n# AWS (for cdx-filter Lambda invocation)\nAWS_PROFILE=docx-corpus  # or set AWS_ACCESS_KEY_ID + AWS_SECRET_ACCESS_KEY\n\n# Classification (for LLM labeling step only)\nANTHROPIC_API_KEY=\n```\n\n## Local Development\n\n```bash\n# Start local PostgreSQL + pgvector\ndocker compose up -d\n\n# Run against local database\nDATABASE_URL=postgres://postgres:postgres@localhost:5432/docx_corpus \\\n  bun run corpus status\n\n# Run web API locally\ncd apps/web/worker\nnpx wrangler dev\n```\n\n## Docker\n\n```bash\ndocker build -t docx-corpus .\ndocker run -e DATABASE_URL=postgres://... docx-corpus scrape --batch 100\n```\n\n## Takedown Requests\n\nIf you find a document you own and would like removed, email [help@docxcorp.us](mailto:help@docxcorp.us) with the document hash or URL and proof of ownership. Processed within 7 days.\n\n## License\n\nMIT\n\n---\n\nBuilt by 🦋 [SuperDoc](https://superdoc.dev)\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fsuperdoc-dev%2Fdocx-corpus","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Fsuperdoc-dev%2Fdocx-corpus","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fsuperdoc-dev%2Fdocx-corpus/lists"}