{"id":49354361,"url":"https://github.com/bravo1goingdark/ucfp","last_synced_at":"2026-04-27T12:06:24.530Z","repository":{"id":321664287,"uuid":"1085944958","full_name":"bravo1goingdark/ucfp","owner":"bravo1goingdark","description":"UCFP is a high-performance, multimodal content fingerprinting framework written in Rust","archived":false,"fork":false,"pushed_at":"2026-03-30T18:08:25.000Z","size":1499,"stargazers_count":6,"open_issues_count":2,"forks_count":1,"subscribers_count":0,"default_branch":"main","last_synced_at":"2026-03-30T20:14:09.738Z","etag":null,"topics":["ann","canonicalization","content-matching","data-fingerprints","fingerprinting","hashing","indexing","ml-pipeline","nlp","perceptual-hashing","redb","semantic-analysis","similarity-search","unicode-normalization"],"latest_commit_sha":null,"homepage":"https://bravo1goingdark.github.io/ucfp/","language":"Rust","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":"apache-2.0","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/bravo1goingdark.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":"CONTRIBUTING.md","funding":null,"license":"LICENSE","code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null,"roadmap":null,"authors":null,"dei":null,"publiccode":null,"codemeta":null,"zenodo":null,"notice":null,"maintainers":null,"copyright":null,"agents":null,"dco":null,"cla":null}},"created_at":"2025-10-29T18:15:41.000Z","updated_at":"2026-03-30T18:08:29.000Z","dependencies_parsed_at":"2025-10-30T22:08:03.254Z","dependency_job_id":"11df3109-2efe-4167-95e6-634ac1ae722b","html_url":"https://github.com/bravo1goingdark/ucfp","commit_stats":null,"previous_names":["bravo1goingdark/ucfp"],"tags_count":0,"template":false,"template_full_name":null,"purl":"pkg:github/bravo1goingdark/ucfp","repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/bravo1goingdark%2Fucfp","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/bravo1goingdark%2Fucfp/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/bravo1goingdark%2Fucfp/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/bravo1goingdark%2Fucfp/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/bravo1goingdark","download_url":"https://codeload.github.com/bravo1goingdark/ucfp/tar.gz/refs/heads/main","sbom_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/bravo1goingdark%2Fucfp/sbom","scorecard":null,"host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":286080680,"owners_count":32335383,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2026-04-26T23:26:28.701Z","status":"online","status_checked_at":"2026-04-27T02:00:06.769Z","response_time":128,"last_error":null,"robots_txt_status":"success","robots_txt_updated_at":"2025-07-24T06:49:26.215Z","robots_txt_url":"https://github.com/robots.txt","online":true,"can_crawl_api":true,"host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["ann","canonicalization","content-matching","data-fingerprints","fingerprinting","hashing","indexing","ml-pipeline","nlp","perceptual-hashing","redb","semantic-analysis","similarity-search","unicode-normalization"],"created_at":"2026-04-27T12:06:23.802Z","updated_at":"2026-04-27T12:06:24.522Z","avatar_url":"https://github.com/bravo1goingdark.png","language":"Rust","funding_links":[],"categories":[],"sub_categories":[],"readme":"\u003cdiv align=\"center\"\u003e\n\n# Universal Content Fingerprinting (UCFP)\n\n**Deterministic, reproducible content fingerprints for text, audio, image, video, and documents**\n\n[![Rust](https://img.shields.io/badge/rust-%23000000.svg?style=for-the-badge\u0026logo=rust\u0026logoColor=white)](https://www.rust-lang.org/)\n[![CI](https://img.shields.io/github/actions/workflow/status/bravo1goingdark/ucfp/ci.yml?style=for-the-badge\u0026label=CI)](https://github.com/bravo1goingdark/ucfp/actions)\n[![License](https://img.shields.io/badge/license-Apache--2.0-blue.svg?style=for-the-badge)](LICENSE)\n[![GitHub stars](https://img.shields.io/github/stars/bravo1goingdark/ucfp?style=for-the-badge\u0026logo=github\u0026color=yellow)](https://github.com/bravo1goingdark/ucfp/stargazers)\n\n\u003c/div\u003e\n\nUCFP is a Rust framework that unifies **exact hashing**, **perceptual similarity**, and **semantic embeddings** into a single pipeline.\n\nTraditional hashes fail when content changes slightly. Semantic search requires understanding beyond byte matching. UCFP gives you both—exact matches and meaning-based similarity—in one deterministic pipeline.\n\n- **Deduplication** — Find exact and near-duplicate content\n- **Plagiarism Detection** — Identify paraphrased text\n- **Content Provenance** — Track content across systems\n- **Similarity Search** — Search by meaning, not just keywords\n\n## Quickstart\n\n**Prerequisites**: Rust 1.76+ (`rustup toolchain install stable`)\n\n```bash\n# Build \u0026 test\ncargo test --all\n\n# Run examples\ncargo run --example full_pipeline          # complete pipeline\ncargo run --example pipeline_metrics       # with observability\ncargo run --package perceptual --example fingerprint_demo\n```\n\n## Usage\n\n```rust\nuse ucfp::{\n    CanonicalizeConfig, IngestConfig, IngestPayload, IngestSource,\n    PerceptualConfig, RawIngestRecord, PipelineStageConfig, process_pipeline,\n};\n\nlet record = RawIngestRecord {\n    id: \"demo\".into(),\n    source: IngestSource::RawText,\n    payload: Some(IngestPayload::Text(\"Hello world\".into())),\n    ..Default::default()\n};\n\nlet (doc, fingerprint, _) = process_pipeline(\n    record,\n    PipelineStageConfig::Perceptual,\n    \u0026IngestConfig::default(),\n    \u0026CanonicalizeConfig::default(),\n    Some(\u0026PerceptualConfig::default()),\n    None,\n)?;\n\nprintln!(\"Canonical hash: {}\", doc.canonical_hash);\nprintln!(\"MinHash bands: {}\", fingerprint.unwrap().minhash_bands.len());\n```\n\nSee [`examples/`](examples/) for full pipeline demonstrations.\n\n## Full Pipeline Example\n\nComplete workflow from ingest to matching:\n\n```rust\nuse ucfp::{\n    CanonicalizeConfig, IngestConfig, IngestMetadata, IngestPayload, IngestSource,\n    PerceptualConfig, RawIngestRecord, SemanticConfig, PipelineStageConfig,\n    process_pipeline,\n};\nuse ucfp_index::{BackendConfig, IndexConfig, IndexRecord, UfpIndex};\nuse ucfp_matcher::{Matcher, MatchConfig, MatchRequest};\n\n// 1. Configure all stages\nlet ingest_cfg = IngestConfig::default();\nlet canonical_cfg = CanonicalizeConfig::default();\nlet perceptual_cfg = PerceptualConfig::default();\nlet semantic_cfg = SemanticConfig::default();\n\n// 2. Create index\nlet index_cfg = IndexConfig::new().with_backend(BackendConfig::InMemory);\nlet index = UfpIndex::new(index_cfg).unwrap();\n\n// 3. Ingest a document\nlet record = RawIngestRecord {\n    id: \"doc-001\".into(),\n    source: IngestSource::RawText,\n    metadata: IngestMetadata {\n        tenant_id: Some(\"tenant-a\".to_string()),\n        doc_id: Some(\"my-doc\".to_string()),\n        ..Default::default()\n    },\n    payload: Some(IngestPayload::Text(\"Rust memory safety features\".into())),\n};\n\n// 4. Process through pipeline (ingest -\u003e canonical -\u003e perceptual -\u003e semantic)\nlet (doc, fingerprint, embedding) = process_pipeline(\n    record,\n    PipelineStageConfig::Perceptual,\n    \u0026ingest_cfg,\n    \u0026canonical_cfg,\n    Some(\u0026perceptual_cfg),\n    Some(\u0026semantic_cfg),\n)?;\n\n// 6. Store in index\nlet record = IndexRecord {\n    doc_id: doc.doc_id.clone(),\n    tenant_id: \"tenant-a\".to_string(),\n    canonical_hash: doc.canonical_hash.clone(),\n    perceptual_fingerprint: Some(fingerprint),\n    semantic_embedding: Some(embedding),\n    ..Default::default()\n};\nindex.upsert(record)?;\n\n// 7. Search with matcher\nlet matcher = Matcher::new(\n    index,\n    ingest_cfg,\n    canonical_cfg,\n    perceptual_cfg,\n    semantic_cfg,\n);\n\nlet req = MatchRequest {\n    tenant_id: \"tenant-a\".to_string(),\n    query_text: \"Rust safety\".to_string(),\n    config: MatchConfig::default(),\n    ..Default::default()\n};\n\nlet hits = matcher.match_document(\u0026req)?;\nprintln!(\"Found {} matches\", hits.len());\n```\n\n## Architecture\n\n| Stage | Responsibility | Key Types |\n|:------|:---------------|:----------|\n| **ingest** | Validation, metadata normalization | `RawIngestRecord`, `CanonicalIngestRecord` |\n| **canonical** | Unicode NFKC normalization, SHA-256 hashing | `CanonicalizedDocument` |\n| **perceptual** | Rolling-hash shingles, winnowing, MinHash LSH | `PerceptualFingerprint` |\n| **semantic** | Dense embeddings via ONNX | `SemanticEmbedding` |\n| **index** | Storage with HNSW ANN search | `UfpIndex`, `QueryResult` |\n| **match** | Query-time matching | `Matcher`, `MatchResult` |\n\n![UCFP Architecture Diagram](ucfp.png)\n\n### System Overview\n\nHow a request flows through the system, from the HTTP client down to storage and back:\n\n```mermaid\nflowchart LR\n    classDef client fill:#fef3c7,stroke:#d97706,stroke-width:2px,color:#78350f\n    classDef edge fill:#dbeafe,stroke:#2563eb,stroke-width:2px,color:#1e3a8a\n    classDef pipe fill:#ede9fe,stroke:#7c3aed,stroke-width:2px,color:#4c1d95\n    classDef store fill:#dcfce7,stroke:#16a34a,stroke-width:2px,color:#14532d\n\n    Client([Client / Web UI]):::client\n\n    subgraph Edge[\"ucfp-server (axum)\"]\n        direction TB\n        MW[/\"middleware:\n        auth · request-id · CORS · logging\"/]:::edge\n        Routes[/\"REST routes:\n        /process · /index · /match · /compare\"/]:::edge\n        MW --\u003e Routes\n    end\n\n    subgraph Pipe[\"Pipeline (ucfp umbrella)\"]\n        direction TB\n        Ingest[[ingest]]:::pipe\n        Canon[[canonical]]:::pipe\n        Perc[[perceptual]]:::pipe\n        Sem[[semantic]]:::pipe\n        Ingest --\u003e Canon --\u003e Perc\n        Canon --\u003e Sem\n    end\n\n    subgraph Store[\"State\"]\n        direction TB\n        Idx[(\"index\u003cbr/\u003eredb · HNSW · DashMap\")]:::store\n        Match[[matcher]]:::store\n    end\n\n    Client ==\u003e|HTTP| MW\n    Routes --\u003e Pipe\n    Perc --\u003e|MinHash bands| Idx\n    Sem  --\u003e|i8 quantized vec| Idx\n    Routes --\u003e Match\n    Match \u003c--\u003e Idx\n    Routes ==\u003e|JSON hits| Client\n```\n\n### Pipeline Data Flow\n\nEach stage produces a strongly-typed artifact that the next stage consumes. Perceptual and semantic branches are independent — either or both can be enabled per request:\n\n```mermaid\nflowchart TD\n    classDef input  fill:#fef3c7,stroke:#d97706,color:#78350f\n    classDef stage  fill:#ede9fe,stroke:#7c3aed,color:#4c1d95\n    classDef artifact fill:#e0f2fe,stroke:#0284c7,color:#0c4a6e\n    classDef output fill:#dcfce7,stroke:#16a34a,color:#14532d\n\n    Raw[\"RawIngestRecord\u003cbr/\u003e\u003ci\u003eid · source · metadata · payload\u003c/i\u003e\"]:::input\n\n    Ingest[\"ingest::ingest()\u003cbr/\u003evalidate · normalize metadata\"]:::stage\n    CanonStep[\"canonical::canonicalize()\u003cbr/\u003eNFKC · lowercase · whitespace · SHA-256\"]:::stage\n    PercStep[\"perceptual::perceptualize_tokens()\u003cbr/\u003ek-shingles · winnowing · MinHash LSH\"]:::stage\n    SemStep[\"semantic::semanticize()\u003cbr/\u003eONNX / API embedding · L2 normalize\"]:::stage\n\n    CIR[\"CanonicalIngestRecord\"]:::artifact\n    Doc[\"CanonicalizedDocument\u003cbr/\u003e\u003ci\u003etokens · canonical_hash\u003c/i\u003e\"]:::artifact\n    FP[\"PerceptualFingerprint\u003cbr/\u003e\u003ci\u003eshingles · minhash[128]\u003c/i\u003e\"]:::artifact\n    Emb[\"SemanticEmbedding\u003cbr/\u003e\u003ci\u003eVec\u0026lt;f32\u0026gt; → quantize → Vec\u0026lt;i8\u0026gt;\u003c/i\u003e\"]:::artifact\n\n    IR[\"IndexRecord\u003cbr/\u003e\u003ci\u003ecanonical_hash · perceptual · embedding · metadata\u003c/i\u003e\"]:::output\n\n    Raw --\u003e Ingest --\u003e CIR --\u003e CanonStep --\u003e Doc\n    Doc --\u003e|tokens| PercStep --\u003e FP\n    Doc --\u003e|canonical text| SemStep --\u003e Emb\n    Doc  --\u003e IR\n    FP   --\u003e IR\n    Emb  --\u003e IR\n```\n\n### Match Strategies\n\n`MatchExpr` is a composable tree — leaves run against the index, inner nodes combine scores:\n\n```mermaid\nflowchart TD\n    classDef q fill:#fef3c7,stroke:#d97706,color:#78350f\n    classDef leaf fill:#e0f2fe,stroke:#0284c7,color:#0c4a6e\n    classDef combine fill:#ede9fe,stroke:#7c3aed,color:#4c1d95\n    classDef out fill:#dcfce7,stroke:#16a34a,color:#14532d\n\n    Q[\"MatchRequest\u003cbr/\u003e\u003ci\u003etenant · query_text · MatchExpr\u003c/i\u003e\"]:::q\n\n    Exact[\"MatchExpr::Exact\u003cbr/\u003equery.canonical_hash == doc.canonical_hash\"]:::leaf\n    Perc[\"MatchExpr::Perceptual { min_score }\u003cbr/\u003eJaccard over MinHash bands\"]:::leaf\n    Sem[\"MatchExpr::Semantic { min_score }\u003cbr/\u003ecosine over i8 embedding (HNSW)\"]:::leaf\n    Weight[\"MatchExpr::Weighted { alpha, min_overall }\u003cbr/\u003eα·sem + (1-α)·perc\"]:::combine\n    And[\"MatchExpr::And\u003cbr/\u003emin(left, right)\"]:::combine\n    Or[\"MatchExpr::Or\u003cbr/\u003emax(left, right)\"]:::combine\n\n    Rank[\"rank · tenant filter · truncate(max_results)\"]:::combine\n    Hits([\"Vec\u0026lt;MatchHit\u0026gt;\u003cbr/\u003ehash · score · per-mode scores · metadata\"]):::out\n\n    Q --\u003e Exact\n    Q --\u003e Perc\n    Q --\u003e Sem\n    Q --\u003e Weight\n    Q --\u003e And\n    Q --\u003e Or\n    Exact  --\u003e Rank\n    Perc   --\u003e Rank\n    Sem    --\u003e Rank\n    Weight --\u003e Rank\n    And    --\u003e Rank\n    Or     --\u003e Rank\n    Rank --\u003e Hits\n```\n\n### Crate Layering\n\nThe workspace is strictly layered — no cycles. Lower crates know nothing of higher ones:\n\n```mermaid\nflowchart BT\n    classDef foundation fill:#e0f2fe,stroke:#0284c7,color:#0c4a6e\n    classDef feature fill:#ede9fe,stroke:#7c3aed,color:#4c1d95\n    classDef glue fill:#fce7f3,stroke:#db2777,color:#831843\n    classDef top fill:#dcfce7,stroke:#16a34a,color:#14532d\n\n    ingest[ingest]:::foundation\n    canonical[canonical]:::foundation\n\n    perceptual[perceptual]:::feature\n    semantic[semantic]:::feature\n\n    index[index]:::glue\n    matcher[matcher]:::glue\n\n    ucfp[ucfp\u003cbr/\u003e\u003ci\u003eumbrella\u003c/i\u003e]:::top\n    server[ucfp-server]:::top\n\n    canonical --\u003e perceptual\n    canonical --\u003e semantic\n    ingest    --\u003e ucfp\n    canonical --\u003e ucfp\n    perceptual --\u003e ucfp\n    semantic   --\u003e ucfp\n    ingest    --\u003e matcher\n    canonical --\u003e matcher\n    perceptual --\u003e matcher\n    semantic  --\u003e matcher\n    index     --\u003e matcher\n    ucfp      --\u003e server\n    matcher   --\u003e server\n    index     --\u003e server\n```\n\n### Request Lifecycle: `POST /api/v1/match`\n\nA traced view of a single match request — useful for understanding latency hotspots:\n\n```mermaid\nsequenceDiagram\n    autonumber\n    participant C as Client\n    participant MW as Middleware\u003cbr/\u003e(auth · request-id · rate limit)\n    participant R as Route\u003cbr/\u003ematching::match_documents\n    participant M as Matcher\n    participant P as Pipeline\u003cbr/\u003e(ingest → canonical → sem/perc)\n    participant I as UfpIndex\u003cbr/\u003e(HNSW + DashMap)\n\n    C-\u003e\u003eMW: POST /api/v1/match + X-API-Key\n    MW-\u003e\u003eMW: validate key · tag request-id\n    MW-\u003e\u003eR: forward\n    R-\u003e\u003eM: MatchRequest { tenant, query_text, MatchExpr }\n\n    rect rgba(237,233,254,0.4)\n        note over M,P: query → fingerprint\n        M-\u003e\u003eP: build RawIngestRecord(query_text)\n        P-\u003e\u003eP: ingest · canonicalize · (perceptual | semantic)\n        P--\u003e\u003eM: CanonicalizedDocument + FP / Embedding\n    end\n\n    rect rgba(224,242,254,0.4)\n        note over M,I: index lookup\n        M-\u003e\u003eI: query_perceptual(fp)\n        I--\u003e\u003eM: Vec\u0026lt;QueryResult\u0026gt;\n        M-\u003e\u003eI: query_semantic(quantized_vec)\n        I--\u003e\u003eM: Vec\u0026lt;QueryResult\u0026gt;\n    end\n\n    M-\u003e\u003eM: score · tenant filter · rank · truncate\n    M--\u003e\u003eR: Vec\u0026lt;MatchHit\u0026gt;\n    R--\u003e\u003eMW: JSON response\n    MW--\u003e\u003eC: 200 OK + hits\n```\n\n## Configuration\n\n```yaml\nversion: \"1.0\"\n\ningest:\n  default_tenant_id: \"acme-corp\"\n  max_payload_bytes: 10485760\n\ncanonical:\n  normalize_unicode: true\n  lowercase: true\n\nperceptual:\n  k: 9              # shingle size\n  w: 4              # winnow window\n  minhash_bands: 16\n\nsemantic:\n  tier: \"balanced\"\n  enable_chunking: true  # For documents \u003e 512 tokens\n\nindex:\n  backend: \"redb\"\n  ann:\n    enabled: true\n    min_vectors_for_ann: 1000\n```\n\nLoad in code:\n```rust\nuse ucfp::config::UcfpConfig;\nlet config = UcfpConfig::from_file(\"config.yaml\")?;\n```\n\n## Performance\n\n| Stage | Latency | Notes |\n|:------|:--------|:------|\n| `ingest` | ~45 μs | Validation + metadata |\n| `canonical` | ~180 μs | Unicode NFKC + SHA-256 |\n| `perceptual` | ~180 μs | Parallel MinHash LSH |\n| `semantic` | ~8.5 ms | ONNX embedding |\n| `index` | ~50 μs | Lock-free DashMap |\n| `match` | ~50-450 μs | ANN O(log n) at \u003e1K vectors |\n\n**Optimizations**: Lock-free concurrency, parallel MinHash, HNSW ANN search, HTTP/2 connection pooling, SIMD vector operations.\n\nDisable semantic stage for ~100 μs/doc when exact + perceptual matching is sufficient.\n\n## API\n\nREST API server included. Quick example:\n\n```bash\ncurl -X POST http://localhost:8080/api/v1/process \\\n  -H \"Content-Type: application/json\" \\\n  -H \"X-API-Key: your-api-key\" \\\n  -d '{\n    \"text\": \"Your document content...\",\n    \"enable_semantic\": true\n  }'\n```\n\n**API Limits:**\n- Maximum text size: **10 MB** per document\n- Maximum batch size: **1000 documents**\n\nSee [`crates/server/API.md`](crates/server/API.md) for full API reference.\n\n## Roadmap\n\n| Modality | Status | Canonicalizer | Fingerprint | Embedding |\n|:---------|:-------|:--------------|:------------|:----------|\n| **Text** | Ready | NFKC + tokenization | MinHash | BGE / E5 |\n| **Image** | Planned | DCT normalization | pHash | CLIP / SigLIP |\n| **Audio** | Planned | Mel-spectrogram | Winnowing | SpeechCLIP / Whisper |\n| **Video** | Planned | Keyframes | Scene hashes | VideoCLIP / XCLIP |\n| **Document** | Planned | OCR + layout | Layout graph | LayoutLMv3 |\n\n## Development\n\n```bash\n./run-ci-local.sh  # Format, lint, test, build\n```\n\nSee [CONTRIBUTING.md](CONTRIBUTING.md) for guidelines.\n\n## License\n\nApache-2.0\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fbravo1goingdark%2Fucfp","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Fbravo1goingdark%2Fucfp","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fbravo1goingdark%2Fucfp/lists"}