{"id":36938314,"url":"https://github.com/gifton/vectoraccelerate","last_synced_at":"2026-04-14T07:02:28.431Z","repository":{"id":326341453,"uuid":"1103581012","full_name":"gifton/VectorAccelerate","owner":"gifton","description":"Swift6 GPU-accelerated vector operations using Metal4 shaders for Apple Silicon performance optimization","archived":false,"fork":false,"pushed_at":"2026-04-11T03:15:48.000Z","size":2549,"stargazers_count":2,"open_issues_count":2,"forks_count":0,"subscribers_count":0,"default_branch":"main","last_synced_at":"2026-04-11T05:35:38.630Z","etag":null,"topics":["accelerator","apple","apple-silicon","metal","swift","vector-database"],"latest_commit_sha":null,"homepage":"","language":"Swift","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":null,"status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/gifton.png","metadata":{"files":{"readme":"README.md","changelog":"CHANGELOG.md","contributing":null,"funding":null,"license":null,"code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null,"roadmap":null,"authors":null,"dei":null,"publiccode":null,"codemeta":null,"zenodo":null,"notice":null,"maintainers":null,"copyright":null,"agents":null,"dco":null,"cla":null}},"created_at":"2025-11-25T04:04:48.000Z","updated_at":"2026-04-11T03:13:41.000Z","dependencies_parsed_at":null,"dependency_job_id":null,"html_url":"https://github.com/gifton/VectorAccelerate","commit_stats":null,"previous_names":["gifton/vectoraccelerate"],"tags_count":17,"template":false,"template_full_name":null,"purl":"pkg:github/gifton/VectorAccelerate","repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/gifton%2FVectorAccelerate","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/gifton%2FVectorAccelerate/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/gifton%2FVectorAccelerate/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/gifton%2FVectorAccelerate/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/gifton","download_url":"https://codeload.github.com/gifton/VectorAccelerate/tar.gz/refs/heads/main","sbom_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/gifton%2FVectorAccelerate/sbom","scorecard":null,"host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":286080680,"owners_count":31785681,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2026-04-14T02:24:21.117Z","status":"ssl_error","status_checked_at":"2026-04-14T02:24:20.627Z","response_time":153,"last_error":"SSL_read: unexpected eof while reading","robots_txt_status":"success","robots_txt_updated_at":"2025-07-24T06:49:26.215Z","robots_txt_url":"https://github.com/robots.txt","online":false,"can_crawl_api":true,"host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["accelerator","apple","apple-silicon","metal","swift","vector-database"],"created_at":"2026-01-13T10:14:39.838Z","updated_at":"2026-04-14T07:02:28.424Z","avatar_url":"https://github.com/gifton.png","language":"Swift","funding_links":[],"categories":[],"sub_categories":[],"readme":"# VectorAccelerate\n\n**GPU-Accelerated Vector Operations for VectorCore — Metal 4 Edition**\n\nVectorAccelerate provides high-performance GPU acceleration for vector operations, serving as the computational backbone for the VectorCore ecosystem. By leveraging Metal 4's compute shaders, unified command encoding, and Apple Silicon's unified memory architecture, VectorAccelerate delivers up to 100x speedups for large-scale vector operations.\n\n\u003e **⚠️ Version 0.4.2**: Requires **Metal 4** (macOS 26.0+, iOS 26.0+, visionOS 3.0+). For older OS support, use VectorAccelerate 0.2.x\n\u003e \n\u003e **⚠️ This package is still experimental, with development and real-world testing in progress** for Production grade Vector operations see VectorCore and VectorIndex's CPU-bound implementation\n\n## 🎯 Purpose\n\nVectorAccelerate exists to solve a critical performance bottleneck in vector-based machine learning applications on in storage, thermal, and power constrained environments. While VectorCore provides an elegant Swift interface for vector operations, VectorAccelerate ensures these operations run at maximum speed by:\n\n- **Metal 4 Acceleration**: Leveraging unified command encoding and tensor operations\n- **Optimized Kernels**: Hand-tuned Metal shaders for specific dimensions (512, 768, 1536)\n- **Memory Efficiency**: Advanced quantization and compression techniques\n- **ML Integration**: Experimental learned distance metrics with MLTensor support\n- **Seamless Integration**: Drop-in acceleration for VectorCore operations\n\n## 🏛️ Two-Layer API Architecture\n\nVectorAccelerate provides **two complementary API layers** designed for different use cases:\n\n```\n┌─────────────────────────────────────────────────────────────┐\n│  Layer 1: High-Level API (AcceleratedVectorIndex)           │\n│  ┌───────────────────────────────────────────────────────┐  │\n│  │  • Complete vector search solution                     │  │\n│  │  • insert(), search(), remove(), compact()             │  │\n│  │  • Automatic GPU/CPU routing                           │  │\n│  │  • Best for: Applications needing similarity search    │  │\n│  └───────────────────────────────────────────────────────┘  │\n├─────────────────────────────────────────────────────────────┤\n│  Layer 2: Low-Level API (25+ GPU Kernel Primitives)         │\n│  ┌───────────────────────────────────────────────────────┐  │\n│  │  • Direct GPU kernel access                            │  │\n│  │  • Distance, Selection, Quantization, Matrix kernels   │  │\n│  │  • Pipeline composition with encode() methods          │  │\n│  │  • Best for: Custom ML pipelines, fine-grained control │  │\n│  └───────────────────────────────────────────────────────┘  │\n└─────────────────────────────────────────────────────────────┘\n```\n\n### Layer 1: High-Level API\n\nFor most applications, use `AcceleratedVectorIndex`:\n\n```swift\nimport VectorAccelerate\n\n// Create a flat index (exact search)\nlet index = try await AcceleratedVectorIndex(\n    configuration: .flat(dimension: 768, capacity: 10_000)\n)\n\n// Insert vectors and get handles\nlet handle = try await index.insert(embedding)\n\n// Search for nearest neighbors\nlet results = try await index.search(query: queryVector, k: 10)\nfor result in results {\n    print(\"Handle: \\(result.id), Distance: \\(result.distance)\")\n}\n```\n\n### Layer 2: Low-Level Kernel Primitives\n\nFor custom pipelines or maximum control, use kernels directly:\n\n```swift\nimport VectorAccelerate\n\nlet context = try await Metal4Context()\n\n// Use individual kernels\nlet l2Kernel = try await L2DistanceKernel(context: context)\nlet distances = try await l2Kernel.compute(\n    queries: queryVectors,\n    database: databaseVectors,\n    computeSqrt: true\n)\n\n// Compose kernels in a pipeline\nlet normKernel = try await L2NormalizationKernel(context: context)\nlet cosineKernel = try await CosineSimilarityKernel(context: context)\nlet topKKernel = try await TopKSelectionKernel(context: context)\n\ntry await context.executeAndWait { _, encoder in\n    normKernel.encode(into: encoder, ...)\n    encoder.memoryBarrier(scope: .buffers)\n    cosineKernel.encode(into: encoder, ...)\n    encoder.memoryBarrier(scope: .buffers)\n    topKKernel.encode(into: encoder, ...)\n}\n```\n\n### Convenient Type Aliases\n\n```swift\n// Context\nlet context: GPUContext = try await Metal4Context()\n\n// Distance kernels\nlet l2 = try await L2Kernel(context: context)\nlet cosine = try await CosineKernel(context: context)\nlet dot = try await DotKernel(context: context)\n\n// Selection kernels\nlet topK = try await TopKKernel(context: context)\nlet fused = try await FusedTopKKernel(context: context)\n\n// Quantization kernels\nlet binary = try await BinaryQuantKernel(context: context)\nlet scalar = try await ScalarQuantKernel(context: context)\n```\n\n## 📦 Requirements\n\n### System Requirements\n- **macOS 26.0+** / **iOS 26.0+** / **tvOS 26.0+** / **visionOS 3.0+**\n- **Metal 4** capable device (Apple Silicon)\n- **Swift 6.0+**\n\n### Dependencies\n- **VectorCore 0.2.0**: The foundational vector mathematics package\n\n### Products\n- **VectorAccelerate**: Core GPU acceleration library\n\n## 🚀 Accelerated Operations\n\nAll kernels require a `Metal4Context` for initialization.\n\n### Core Distance Metrics\n- **L2 Distance** (`L2DistanceKernel`)\n  - Euclidean distance with optional sqrt\n  - Specialized kernels for dimensions 512, 768, 1536\n  - Batch processing for multiple query-database pairs\n\n- **Cosine Similarity** (`CosineSimilarityKernel`)\n  - Pre-normalized and with-normalization variants\n  - Output as similarity or distance (1 - similarity)\n  - Optimized for high-dimensional embeddings\n\n- **Dot Product** (`DotProductKernel`)\n  - SIMD-optimized implementation\n  - Automatic GEMV/GEMM path selection\n  - Batch matrix-vector products\n\n### Advanced Distance Metrics\n- **Hamming Distance** (`HammingDistanceKernel`) - Binary vector distances\n- **Minkowski Distance** (`MinkowskiDistanceKernel`) - Generalized Lp distances (includes L1/Manhattan and L∞/Chebyshev)\n- **Jaccard Distance** (`JaccardDistanceKernel`) - Set similarity for sparse data\n\n### Experimental ML Features\n- **Learned Distance** (`LearnedDistanceKernel`)\n  - Projection-based learned metrics\n  - MLTensor integration for neural distance computation\n  - Automatic fallback to standard L2 when unavailable\n\n- **Attention Similarity** (`AttentionSimilarityKernel`)\n  - Attention-weighted similarity computation\n  - Multi-head attention support\n\n- **Neural Quantization** (`NeuralQuantizationKernel`)\n  - Learned quantization with neural networks\n  - Adaptive codebook generation\n\n### Vector Operations\n- **L2 Normalization** (`L2NormalizationKernel`)\n  - In-place and out-of-place normalization\n  - Numerical stability for zero vectors\n  - Batch normalization support\n\n- **Element-wise Operations** (`ElementwiseKernel`)\n  - Addition, subtraction, multiplication, division\n  - Trigonometric functions (sin, cos, tan)\n  - Power and exponential operations\n  - Broadcasting support\n\n### Selection \u0026 Sorting\n- **Top-K Selection** (`TopKSelectionKernel`)\n  - General purpose top-k with configurable k\n  - Warp-level optimization for common k values\n  - Streaming support for massive datasets\n\n- **Fused L2 + Top-K** (`FusedL2TopKKernel`)\n  - Combined distance computation and selection\n  - Reduced memory bandwidth\n  - Optimal for nearest neighbor search\n\n- **Parallel Reduction** (`ParallelReductionKernel`)\n  - Sum, min, max reduction\n  - Mean and variance computation\n  - Custom reduction operations\n\n### Matrix Operations\n- **Matrix Multiply (GEMM)** (`MatrixMultiplyKernel`)\n  - Tiled implementation with shared memory (32×32×8 tiles)\n  - Support for transposed inputs\n  - Alpha/beta scaling (C = α·A·B + β·C)\n\n- **Matrix-Vector (GEMV)** (`MatrixVectorKernel`)\n  - SIMD group optimizations\n  - Row/column major support\n  - Batch vector operations\n\n- **Matrix Transpose** (`MatrixTransposeKernel`)\n  - Tiled transpose for coalesced access\n  - Optimized shared memory usage\n\n- **Batch Matrix Operations** (`BatchMatrixKernel`)\n  - Fused multiply-add with bias\n  - Strided tensor operations\n\n### Statistical Operations\n- **Statistics** (`StatisticsKernel`)\n  - Mean, variance, standard deviation\n  - Skewness and kurtosis\n  - Percentiles and quartiles\n  - Running statistics updates\n\n- **Histogram** (`HistogramKernel`)\n  - Uniform, adaptive, and logarithmic binning\n  - Multi-dimensional histograms\n  - Kernel density estimation\n\n### Quantization \u0026 Compression\n- **Scalar Quantization** (`ScalarQuantizationKernel`)\n  - INT8/INT4 quantization with scale and offset\n  - Symmetric and asymmetric modes\n  - Per-channel quantization\n\n- **Binary Quantization** (`BinaryQuantizationKernel`)\n  - 1-bit vector compression\n  - Hamming distance on packed bits\n  - 32x memory reduction\n\n- **Product Quantization** (`ProductQuantizationKernel`)\n  - Subspace decomposition\n  - Codebook-based compression\n  - Asymmetric distance computation\n\n## 🗂️ GPU-Accelerated Vector Index\n\nGPU-first vector index for high-performance similarity search.\n\n### Quick Start\n\n```swift\nimport VectorAccelerate\n\n// Create a GPU-accelerated flat index\nlet config = IndexConfiguration.flat(dimension: 768, capacity: 10_000)\nlet index = try await AcceleratedVectorIndex(configuration: config)\n\n// Insert vectors\nlet handle = try await index.insert(embedding, metadata: [\"type\": \"document\"])\n\n// Search (returns L2² distances - native GPU format)\nlet results = try await index.search(query: queryVector, k: 10)\n\n// Filtered search\nlet filtered = try await index.search(query: queryVector, k: 10) { handle, meta in\n    meta?[\"type\"] == \"document\"\n}\n```\n\n### Index Types\n\n| Type | Use Case | Performance |\n|------|----------|-------------|\n| **Flat** | Small-medium datasets (\u003c100K) | Exact results, ~0.3ms search |\n| **IVF** | Large datasets (100K+) | Approximate, faster at scale |\n\n### Handle Stability\n\n`VectorHandle` instances are **stable** — they remain valid across `compact()` operations:\n\n```swift\nlet handle = try await index.insert(embedding)\n\n// Delete some vectors\ntry await index.remove(otherHandle)\n\n// Compact to reclaim space\ntry await index.compact()\n\n// Original handle still works!\nlet vector = try await index.vector(for: handle)  // ✓ Valid\n```\n\nInternally, handles use an indirection table that maps stable IDs to storage slots, so compaction can relocate vectors without invalidating user-held handles.\n\n### Performance\n\n- **Insert**: ~21K vectors/sec (128D), ~3.7K vectors/sec (768D)\n- **Search**: 0.30ms (128D), 0.73ms (768D) on 5K vectors\n- **Sub-millisecond** latency for typical workloads\n\n## 🔧 Installation\n\n### Swift Package Manager\n\nAdd VectorAccelerate to your `Package.swift`:\n\n```swift\ndependencies: [\n    .package(url: \"https://github.com/gifton/VectorAccelerate.git\", from: \"0.4.2\"),\n    .package(url: \"https://github.com/gifton/VectorCore.git\", from: \"0.2.0\")\n],\ntargets: [\n    .target(\n        name: \"YourTarget\",\n        dependencies: [\n            \"VectorAccelerate\",\n            \"VectorCore\"\n        ]\n    )\n]\n```\n\n\u003e **Note**: Version 0.4.2+ requires Metal 4. For macOS 15 / iOS 18 support, use version 0.1.x.\n\n## 🎓 Getting Started\n\n### Basic Usage\n\n```swift\nimport VectorAccelerate\nimport VectorCore\n\n// Initialize Metal 4 context (async)\nlet context = try await Metal4Context()\nlet l2Kernel = try await L2DistanceKernel(context: context)\n\n// Prepare your vectors\nlet queries = [[Float]](repeating: [Float](repeating: 0.5, count: 768), count: 10)\nlet database = [[Float]](repeating: [Float](repeating: 0.3, count: 768), count: 1000)\n\n// Compute distances on GPU\nlet distances = try await l2Kernel.compute(\n    queries: queries,\n    database: database,\n    computeSqrt: true  // For Euclidean distance\n)\n\nprint(\"Computed \\(distances.count) x \\(distances[0].count) distances on GPU\")\n```\n\n### Using with VectorCore Types\n\n```swift\nimport VectorCore\n\n// Create VectorCore vectors\nlet vector1 = Vector\u003cD768\u003e([Float](repeating: 1.0, count: 768))\nlet vector2 = Vector\u003cD768\u003e([Float](repeating: 0.5, count: 768))\n\n// Use GPU-accelerated operations\nlet context = try await Metal4Context()\nlet cosineSim = try await CosineSimilarityKernel(context: context)\nlet similarity = try await cosineSim.compute(\n    queries: [vector1],\n    database: [vector2]\n)\n```\n\n### Advanced: Fused Operations\n\n```swift\n// Fused L2 distance + Top-K selection (single kernel execution)\nlet context = try await Metal4Context()\nlet fusedKernel = try await FusedL2TopKKernel(context: context)\n\n// Prepare query and dataset vectors\nlet queries = [[Float]](repeating: [Float](repeating: 0.5, count: 768), count: 10)\nlet dataset = [[Float]](repeating: [Float](repeating: 0.3, count: 768), count: 1000)\n\nlet results = try await fusedKernel.findNearestNeighbors(\n    queries: queries,\n    dataset: dataset,\n    k: 10\n)\n\n// Result contains top-10 nearest neighbors for each query\nfor (queryIndex, neighbors) in results.enumerated() {\n    print(\"Query \\(queryIndex): \\(neighbors.results.count) neighbors found\")\n}\n```\n\n### Learned Distance (Experimental ML)\n\n```swift\n// Use learned projections for distance computation\nlet context = try await Metal4Context()\nlet config = AccelerationConfiguration(enableExperimentalML: true)\nlet service = try await LearnedDistanceService(context: context, configuration: config)\n\n// Load projection weights (e.g., from a trained model)\ntry await service.loadProjection(\n    from: weightsURL,\n    inputDim: 768,\n    outputDim: 128\n)\n\n// Compute distances with learned projection\nlet (distances, mode) = try await service.computeL2(\n    queries: queries,\n    database: database\n)\n\nprint(\"Computed using \\(mode) mode\")  // .learned or .standard (fallback)\n```\n\n### Matrix Operations\n\n```swift\n// GPU-accelerated matrix multiplication\nlet context = try await Metal4Context()\nlet matrixKernel = try await MatrixMultiplyKernel(context: context)\n\nlet a = Matrix.random(rows: 1024, columns: 512)\nlet b = Matrix.random(rows: 512, columns: 256)\n\nlet result = try await matrixKernel.multiply(a: a, b: b)\nprint(\"Result: \\(result.rows) x \\(result.columns)\")\n```\n\n### VectorCore Integration\n\nVectorAccelerate provides GPU-accelerated `DistanceProvider` implementations:\n\n```swift\nimport VectorAccelerate\nimport VectorCore\n\n// Create a kernel-backed distance provider\nlet context = try await Metal4Context()\nlet provider = try await L2KernelDistanceProvider(context: context)\n\n// Works with VectorCore's DynamicVector\nlet v1 = DynamicVector([1.0, 0.0, 0.0])\nlet v2 = DynamicVector([0.0, 1.0, 0.0])\n\nlet distance = try await provider.distance(from: v1, to: v2, metric: .euclidean)\n\n// Batch distance computation\nlet candidates = [v1, v2, DynamicVector([0.5, 0.5, 0.0])]\nlet distances = try await provider.batchDistance(from: v1, to: candidates, metric: .euclidean)\n```\n\n**Universal Distance Provider** - handles all metrics automatically:\n\n```swift\n// Dispatches to the optimal kernel for each metric\nlet provider = context.universalDistanceProvider()\n\nlet euclidean = try await provider.distance(from: v1, to: v2, metric: .euclidean)\nlet cosine = try await provider.distance(from: v1, to: v2, metric: .cosine)\nlet manhattan = try await provider.distance(from: v1, to: v2, metric: .manhattan)\n```\n\n**Available Distance Providers:**\n\n| Provider | Metric | Features |\n|----------|--------|----------|\n| `L2KernelDistanceProvider` | Euclidean | Dimension-optimized (384, 512, 768, 1536) |\n| `CosineKernelDistanceProvider` | Cosine | Auto-normalization |\n| `DotProductKernelDistanceProvider` | Dot Product | GEMV optimization |\n| `MinkowskiKernelDistanceProvider` | Manhattan, Chebyshev | Configurable p-norm |\n| `JaccardKernelDistanceProvider` | Jaccard | Set similarity (specialized API) |\n| `HammingKernelDistanceProvider` | Hamming | Binary vectors (specialized API) |\n| `UniversalKernelDistanceProvider` | All | Auto-dispatch |\n\n## 📋 Choosing the Right Kernel\n\n### Distance Computation\n\n| Use Case | Recommended Kernel | Notes |\n|----------|-------------------|-------|\n| Nearest neighbor search | `L2DistanceKernel` | Best for embeddings |\n| Semantic similarity | `CosineSimilarityKernel` | Direction-based |\n| Maximum inner product | `DotProductKernel` | For unnormalized vectors |\n| Sparse data / sets | `JaccardDistanceKernel` | Document fingerprints |\n| Binary vectors | `HammingDistanceKernel` | After binary quantization |\n| Custom Lp norm | `MinkowskiDistanceKernel` | Configurable p |\n\n### Selection\n\n| Use Case | Recommended Kernel | Notes |\n|----------|-------------------|-------|\n| Standard top-k | `TopKSelectionKernel` | General purpose |\n| Memory-constrained | `FusedL2TopKKernel` | Avoids full distance matrix, auto-fallback for K \u003e 8 |\n| Large datasets | `FusedL2TopKKernel` | Uses chunked GPU merge for memory efficiency |\n| SIMD-optimized | `WarpOptimizedSelectionKernel` | Small k values (k ≤ 32) |\n\n\u003e **Note**: `StreamingTopKKernel` is deprecated due to known correctness issues. Use `FusedL2TopKKernel` which automatically handles large datasets via chunked GPU merge.\n\n### Quantization\n\n| Use Case | Recommended Kernel | Compression |\n|----------|-------------------|-------------|\n| Fast approximate search | `BinaryQuantizationKernel` | 32x |\n| Quality-preserving | `ScalarQuantizationKernel` | 4-8x |\n| High compression | `ProductQuantizationKernel` | 32-64x |\n\n## 🏗️ Architecture\n\n### Kernel Organization\n\nVectorAccelerate is organized into specialized Metal 4 kernels:\n\n```\nVectorAccelerate/\n├── Core/                    # Metal 4 infrastructure\n│   ├── Metal4Context.swift      # Unified context management\n│   ├── Metal4ComputeEngine.swift # Command encoding\n│   ├── ResidencyManager.swift    # Memory residency\n│   ├── PipelineCache.swift       # Pipeline state caching\n│   └── TensorManager.swift       # Tensor operations\n├── Kernels/\n│   └── Metal4/              # All Metal 4 kernel implementations\n│       ├── L2DistanceKernel.swift\n│       ├── CosineSimilarityKernel.swift\n│       ├── MatrixMultiplyKernel.swift\n│       └── ... (20+ kernels)\n├── Metal/\n│   └── Shaders/             # Metal shader source files (.metal)\n└── Operations/              # High-level operation orchestration\n```\n\n### Metal 4 Features Used\n\n- **Unified Command Encoding**: Single encoder for compute + blit operations\n- **Residency Sets**: Explicit memory management for optimal GPU utilization\n- **Argument Tables**: Efficient parameter passing to shaders\n- **Pipeline Harvesting**: Background pipeline compilation\n\n### Performance Optimizations\n\n1. **Hierarchical SIMD Reductions**: Overhauled L2 and Cosine kernels using a 4-phase reduction model (`Local -\u003e Warp -\u003e Threadgroup -\u003e Global`), maximizing 128-bit memory bus saturation and eliminating global atomic stalls.\n2. **Tiled Shared Memory Algorithms**: KMeans assignment dynamically scales centroid tiles to fit within hardware limits (32KB), using register-cached compute loops to reduce global memory pressure by 32x.\n3. **2-Pass GPU Orchestration**: K-Means update logic uses a \"Cooperative Gather\" topology, collaboratively summing dimensions via SIMD-group operations for maximum throughput.\n4. **Enforced Asynchronous Execution**: Fully non-blocking execution model using `await commitAndWait()` and Swift 6 concurrency, ensuring zero OS thread stalls during GPU work.\n5. **Dynamic Buffer Pooling**: Ring-buffer strategy with Power-of-2 bucketing eliminates allocation overhead in hot loops, with `BufferToken` anchoring for safe asynchronous memory reclamation.\n6. **Eager Pipeline Pre-compilation**: Background pre-compilation of critical path kernels during initialization to eliminate cold-start latency.\n7. **Tiled GEMM Neural Encoder**: High-performance 2-pass neural encoder using a Full-D register loop and shared memory padding to eliminate bank conflicts and global atomic bottlenecks.\n8. **Vectorized Transposed Decoder**: Optimized dequantization path using dual-accumulator latency hiding and dimension-specific loop unrolling for a 2x throughput gain.\n\n## 📊 Performance\n\nTypical speedups over CPU implementations:\n\n| Operation | Vector Size | CPU Time | GPU Time | Speedup |\n|-----------|------------|----------|----------|---------|\n| L2 Distance | 1M × 10K | 12.3s | 0.15s | 82x |\n| Cosine Similarity | 100K × 100K | 8.7s | 0.09s | 97x |\n| Matrix Multiply | 2048 × 2048 | 1.8s | 0.03s | 60x |\n| Top-K Selection | 1M, k=100 | 0.9s | 0.02s | 45x |\n| INT8 Quantization | 10M vectors | 2.1s | 0.04s | 52x |\n\n## ⚠️ Known Limitations\n\n### Fused L2 Top-K Kernel\n\nThe `FusedL2TopKKernel` uses different strategies based on K:\n\n| K Range | Strategy | Memory |\n|---------|----------|--------|\n| K ≤ 8 | Fused single-pass | O(Q × K) - no distance matrix |\n| 8 \u003c K ≤ 32 | Two-pass with warp selection | O(Q × N) distance matrix |\n| K \u003e 32 | Two-pass with standard selection | O(Q × N) distance matrix |\n| Large N | Chunked GPU merge | Bounded by `maxDistanceMatrixBytes` |\n\nFor K \u003e 8, the kernel automatically falls back to a two-pass approach. For very large datasets, it uses chunked processing with GPU-side merge to stay within memory bounds.\n\n### IVF Index (Work in Progress)\n\nThe IVF index has known correctness issues being addressed:\n- Batch search may not correctly isolate per-query candidate lists\n- Training with many deletions may cause duplicate entries\n\nFor production use, prefer the **flat index** until these are resolved (see `QUALITY_IMPROVEMENT_ROADMAP.md` P0.1-P0.4).\n\n### Metal Shader Compilation\n\nShaders are validated with `-std=metal3.0` in CI for syntax checking, but the runtime requires **Metal 4** features (macOS 26.0+). This is intentional — Metal 3 syntax is a subset of Metal 4.\n\n## 🧪 Testing\n\nRun the test suite:\n\n```bash\nswift test\n```\n\nRun performance benchmarks:\n\n```bash\nswift run VectorAccelerateBenchmarks\n```\n\n## 🤝 Contributing\n\nContributions are welcome! Please see [CONTRIBUTING.md](CONTRIBUTING.md) for guidelines.\n\n### Areas for Contribution\n- Additional distance metrics\n- Optimizations for new Apple Silicon features (M4, M5)\n- Enhanced ML integration\n- Performance benchmarks and comparisons\n\n## 📄 License\n\nVectorAccelerate is available under the MIT license. See [LICENSE](LICENSE) for details.\n\n## 🙏 Acknowledgments\n\n- Apple's Metal team for the excellent GPU framework\n- The VectorCore team for the foundational vector library\n- Contributors to the scientific computing community\n\n## 📚 Related Projects\n\n- [VectorCore](https://github.com/gifton/VectorCore) - Core vector mathematics\n- [VectorIndex](https://github.com/gifton/VectorIndex) - CPU-based vector indexing algorithms\n- [VectorDatabase](https://github.com/gifton/VectorDatabase) - Complete vector database solution\n\n---\n\n**Requirements**: VectorAccelerate 0.4.2+ requires **Metal 4** (macOS 26.0+, iOS 26.0+, visionOS 3.0+) and Apple Silicon. For older OS versions, use VectorAccelerate 0.1.x.\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fgifton%2Fvectoraccelerate","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Fgifton%2Fvectoraccelerate","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fgifton%2Fvectoraccelerate/lists"}