{"id":50560035,"url":"https://github.com/genpat-it/ohe-rs","last_synced_at":"2026-06-04T11:30:39.741Z","repository":{"id":351293002,"uuid":"1210375477","full_name":"genpat-it/ohe-rs","owner":"genpat-it","description":"Ultra-fast one-hot encoding for bioinformatics and ML, powered by Rust + CUDA. Built for cgMLST allele profiles and large-scale categorical data.","archived":false,"fork":false,"pushed_at":"2026-04-14T13:13:14.000Z","size":78,"stargazers_count":0,"open_issues_count":0,"forks_count":0,"subscribers_count":0,"default_branch":"master","last_synced_at":"2026-04-14T13:13:15.883Z","etag":null,"topics":["bioinformatics","cuda","machine-learning","one-hot-encoding","performance","pyo3","python","rust"],"latest_commit_sha":null,"homepage":null,"language":"Rust","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":"mit","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/genpat-it.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":null,"license":"LICENSE","code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null,"roadmap":null,"authors":null,"dei":null,"publiccode":null,"codemeta":null,"zenodo":null,"notice":null,"maintainers":null,"copyright":null,"agents":null,"dco":null,"cla":null}},"created_at":"2026-04-14T11:00:43.000Z","updated_at":"2026-04-14T13:13:13.000Z","dependencies_parsed_at":null,"dependency_job_id":null,"html_url":"https://github.com/genpat-it/ohe-rs","commit_stats":null,"previous_names":["genpat-it/ohe-rs"],"tags_count":1,"template":false,"template_full_name":null,"purl":"pkg:github/genpat-it/ohe-rs","repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/genpat-it%2Fohe-rs","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/genpat-it%2Fohe-rs/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/genpat-it%2Fohe-rs/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/genpat-it%2Fohe-rs/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/genpat-it","download_url":"https://codeload.github.com/genpat-it/ohe-rs/tar.gz/refs/heads/master","sbom_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/genpat-it%2Fohe-rs/sbom","scorecard":null,"host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":286080680,"owners_count":33903134,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2026-05-26T15:22:16.424Z","status":"online","status_checked_at":"2026-06-04T02:00:06.755Z","response_time":64,"last_error":null,"robots_txt_status":"success","robots_txt_updated_at":"2025-07-24T06:49:26.215Z","robots_txt_url":"https://github.com/robots.txt","online":true,"can_crawl_api":true,"host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["bioinformatics","cuda","machine-learning","one-hot-encoding","performance","pyo3","python","rust"],"created_at":"2026-06-04T11:30:39.307Z","updated_at":"2026-06-04T11:30:39.732Z","avatar_url":"https://github.com/genpat-it.png","language":"Rust","funding_links":[],"categories":[],"sub_categories":[],"readme":"# ohe-rs\n\n[![License: MIT](https://img.shields.io/badge/License-MIT-blue.svg)](LICENSE)\n[![Rust](https://img.shields.io/badge/Rust-1.70%2B-orange.svg)](https://www.rust-lang.org/)\n[![Python](https://img.shields.io/badge/Python-3.9%2B-blue.svg)](https://www.python.org/)\n[![CUDA](https://img.shields.io/badge/CUDA-Optional-green.svg)](https://developer.nvidia.com/cuda-toolkit)\n[![Bioconda](https://img.shields.io/conda/vn/bioconda/ohe-rs.svg)](https://bioconda.github.io/recipes/ohe-rs/README.html)\n[![Docker](https://img.shields.io/badge/Docker-ghcr.io-blue.svg)](https://github.com/genpat-it/ohe-rs/pkgs/container/ohe-rs)\n[![GitHub Release](https://img.shields.io/github/v/release/genpat-it/ohe-rs)](https://github.com/genpat-it/ohe-rs/releases)\n\nUltra-fast one-hot encoding powered by Rust + CUDA, with Python bindings.\n\n## Why ohe-rs?\n\nOne-hot encoding is a fundamental operation in machine learning pipelines, yet existing implementations in Python (scikit-learn, pandas, numpy) are surprisingly slow on large datasets. They suffer from Python overhead, single-threaded execution, and suboptimal memory access patterns.\n\n**ohe-rs** solves this by implementing one-hot encoding in Rust with:\n\n- **Parallel category discovery** using rayon + FxHashMap (lock-free, per-thread local maps with global merge)\n- **Zero-copy Python integration** via PyO3 + numpy array protocol\n- **Sparse CSR output** that uses ~13 bytes/row regardless of cardinality (vs N*K for dense)\n- **Optional CUDA acceleration** for GPU-resident data pipelines\n- **Memory-safe operation** with upfront estimation and chunked processing for large datasets\n\n## Benchmark Results\n\n**Machine:** 2x Intel Xeon Gold 6542Y (80 cores), 504 GB RAM, NVIDIA L4 (24 GB), Linux.\n**Protocol:** 10M rows, warm-up excluded, GC disabled, 7 repeats (median), uint8 output, 80 rayon threads.\n\n### End-to-End (category discovery + encoding)\n\n| Cardinality (K) | ohe-rs CPU | scikit-learn | Speedup |\n|---|---|---|---|\n| K = 10 | **26 ms** (387 M rows/s) | 381 ms | **15x** |\n| K = 1,000 | **21 ms** (468 M rows/s) | 740 ms | **35x** |\n| K = 100,000 | **56 ms** (179 M rows/s) | 1,310 ms | **23x** |\n\n### Transform-Only (K pre-known, no discovery) — full cartesian product\n\nEvery combination of H2D (host-to-device) and D2H (device-to-host) transfer benchmarked for both ohe-rs and PyTorch.\n\n**CPU:**\n\n| Method | K=10 | K=1,000 | K=100,000 |\n|---|---|---|---|\n| **ohe-rs CPU** | **18 ms** | **18 ms** | **21 ms** |\n| PyTorch sparse COO CPU | 29 ms | 29 ms | 28 ms |\n| sklearn (prefitted) | 400 ms | 698 ms | 1,337 ms |\n\n**GPU with H2D + D2H (data on host, result on host):**\n\n| Method | K=10 | K=1,000 | K=100,000 |\n|---|---|---|---|\n| **ohe-rs GPU** | **48 ms** | **44 ms** | 75 ms |\n| PyTorch GPU | 75 ms | 73 ms | **73 ms** |\n\n**GPU pre-loaded input, D2H output (kernel + D2H):**\n\n| Method | K=10 | K=1,000 | K=100,000 |\n|---|---|---|---|\n| **ohe-rs GPU** | **25 ms** | **25 ms** | **25 ms** |\n| PyTorch GPU | 66 ms | 65 ms | 64 ms |\n\n**GPU all on device — kernel only (no transfer):**\n\n| Method | K=10 | K=1,000 | K=100,000 |\n|---|---|---|---|\n| **ohe-rs GPU** | **1.3 ms** | **1.4 ms** | **1.4 ms** |\n| PyTorch GPU | 1.5 ms | 1.5 ms | 1.5 ms |\n\n\u003e **ohe-rs wins in nearly every scenario.** At K=100K with full H2D+D2H, PyTorch edges ahead (73ms vs 75ms) due to lower transfer overhead for COO metadata vs CSR arrays.\n\n\u003e **PyTorch `F.one_hot` limitation:** allocates a dense **int64** tensor (8 bytes/element) before casting. At K=1,000 with 10M rows this requires **80 GB of RAM**. ohe-rs sparse uses ~13 bytes/row regardless of K.\n\n### Thread Scaling\n\nOne-hot encoding is **memory-bandwidth bound**, not compute-bound. More threads help only up to the point where RAM bandwidth saturates. On our 80-core machine, the sweet spot is **8-16 threads**:\n\n| Threads | E2E K=10 | E2E K=100K | Transform K=10 |\n|---|---|---|---|\n| 1 | 58 ms | 273 ms | 24 ms |\n| 2 | 38 ms | 148 ms | 20 ms |\n| 4 | 30 ms | 101 ms | 22 ms |\n| 8 | **20 ms** | 70 ms | **16 ms** |\n| 16 | 20 ms | **62 ms** | 16 ms |\n| 32 | 20 ms | 55 ms | 20 ms |\n| 80 | 28 ms | 64 ms | 29 ms |\n\nBeyond 16 threads, performance **degrades** due to cache contention. On typical workstations (4-8 cores), all cores are useful. Use `set_threads()` to tune:\n\n```python\nfrom ohe_rs import set_threads\nset_threads(8)  # recommended for machines with \u003e16 cores\n```\n\n## Installation\n\n### From source (recommended)\n\n```bash\n# Clone\ngit clone https://github.com/genpat-it/ohe-rs.git\ncd ohe-rs\n\n# CPU-only build\npip install maturin\nmaturin develop --release\n\n# With CUDA support (requires CUDA toolkit)\nCUDA_ROOT=/usr/local/cuda maturin develop --release --features cuda\n```\n\n### Docker (ghcr.io)\n\n```bash\n# CPU-only\ndocker pull ghcr.io/genpat-it/ohe-rs:latest\ndocker run --rm ghcr.io/genpat-it/ohe-rs -c \"from ohe_rs import encode_sparse; print('OK')\"\n\n# With CUDA support\ndocker pull ghcr.io/genpat-it/ohe-rs:latest-cuda\ndocker run --rm --gpus all ghcr.io/genpat-it/ohe-rs:latest-cuda -c \"from ohe_rs import gpu_available; print('GPU:', gpu_available())\"\n```\n\nImages are automatically built and pushed on each release.\n\n### Bioconda\n\n```bash\nconda install -c bioconda ohe-rs\n```\n\nAvailable for Python 3.10, 3.11, 3.12, 3.13 on linux-64 and linux-aarch64.\n\n### Build requirements\n\n- Rust 1.70+\n- Python 3.9+\n- numpy \u003e= 1.20\n- scipy \u003e= 1.7\n- CUDA toolkit (optional, for GPU support)\n\n## Usage\n\n### Sparse encoding (recommended)\n\n```python\nimport numpy as np\nfrom scipy.sparse import csr_matrix\nfrom ohe_rs import encode_sparse\n\ndata = np.array([0, 1, 2, 0, 1, 2, 3], dtype=np.int64)\nvalues, indices, indptr, n_categories = encode_sparse(data)\n\n# Build scipy sparse matrix\nmatrix = csr_matrix((values, indices, indptr), shape=(len(data), n_categories))\nprint(matrix.toarray())\n# [[1 0 0 0]\n#  [0 1 0 0]\n#  [0 0 1 0]\n#  [1 0 0 0]\n#  [0 1 0 0]\n#  [0 0 1 0]\n#  [0 0 0 1]]\n```\n\n### Dense encoding\n\n```python\nfrom ohe_rs import encode_dense\n\ndata = np.array([0, 1, 2, 0], dtype=np.int64)\nmatrix = encode_dense(data)  # np.ndarray, shape (4, 3), dtype uint8\n```\n\n### String input\n\n```python\nfrom ohe_rs import encode_strings_sparse\n\nstrings = [\"cat\", \"dog\", \"cat\", \"bird\", \"dog\"]\nvalues, indices, indptr, categories, n_cats = encode_strings_sparse(strings)\nprint(categories)  # ['cat', 'dog', 'bird']\n```\n\n### Multi-column encoding (cgMLST / allele profiles)\n\nFor datasets with many categorical columns (e.g. cgMLST allele profiles), `encode_multi_sparse` encodes all columns in a single Rust call, avoiding Python loop overhead.\n\n```python\nimport numpy as np\nfrom scipy.sparse import csr_matrix\nfrom ohe_rs import encode_multi_sparse\n\n# cgMLST-like matrix: 10K samples x 8K loci, each cell is an allele ID\nprofiles = np.random.randint(0, 300, size=(10_000, 8_000), dtype=np.int64)\n\n# Single call — encodes all columns in parallel\nvalues, indices, indptr, total_cols, per_col_sizes = encode_multi_sparse(profiles)\n\n# Build scipy sparse matrix (rows=samples, cols=concatenated one-hot of all loci)\nmatrix = csr_matrix((values, indices, indptr), shape=(10_000, total_cols))\n# matrix.shape = (10000, ~2.4M)  — each row has exactly 8000 non-zeros\n```\n\n**Performance (10K samples x 8K loci, ~50-500 alleles per locus):**\n\n| Method | Time | Speedup |\n|---|---|---|\n| **ohe-rs encode_multi_sparse** | **724 ms** | **12x** |\n| ohe-rs per-column Python loop | 2,491 ms | 3.5x |\n| sklearn per-column | 8,618 ms | baseline |\n\n**Memory usage:**\n\n| | Size |\n|---|---|\n| Input matrix (int64) | 640 MB |\n| Sparse output (CSR) | **400 MB** |\n| Dense equivalent (uint8) | 21.8 GB |\n\nSparse uses **1.8%** of the memory that dense would require. The output matrix has shape (10,000 x 2,178,687) with 80M non-zeros — each row has exactly 8,000 ones (one per locus).\n\n### Cached encoder (fit once, transform many)\n\nFor repeated encoding against the same schema (e.g. new samples arriving daily):\n\n```python\nfrom ohe_rs import MultiEncoder\n\n# Fit once on reference dataset (builds category maps)\nencoder = MultiEncoder.fit(reference_profiles)  # 251 ms\n\n# Transform new samples instantly (skip discovery)\nresult = encoder.transform(new_profiles)  # 27 ms for 500 samples\n\n# Or fit + transform in one call\nencoder, values, indices, indptr, total_k, col_sizes = MultiEncoder.fit_transform(profiles)\n\n# Inspect\nprint(encoder.n_loci)                # 8000\nprint(encoder.total_columns)         # 2,178,687\nprint(encoder.categories_per_column) # [124, 89, 201, ...]\n```\n\n| Operation | Time |\n|---|---|\n| `fit` (10K reference, one-time) | 251 ms |\n| `transform` (500 new samples) | **27 ms** |\n| `transform` (10K samples) | 465 ms |\n| `encode_multi_sparse` (no cache) | 665 ms |\n\n### Memory estimation\n\n```python\nfrom ohe_rs import estimate_memory\n\ndata = np.random.randint(0, 100_000, size=10_000_000, dtype=np.int64)\ndense_bytes, sparse_bytes = estimate_memory(data)\nprint(f\"Dense: {dense_bytes / 1e9:.1f} GB\")   # Dense: 1000.0 GB\nprint(f\"Sparse: {sparse_bytes / 1e6:.1f} MB\")  # Sparse: 130.0 MB\n```\n\n### Memory-safe dense encoding\n\n```python\n# Automatically processes in chunks if needed\nmatrix = encode_dense(data, max_memory_mb=512)\n```\n\n### GPU acceleration\n\n```python\nfrom ohe_rs import gpu_available\n\nif gpu_available():\n    from ohe_rs import gpu_encode_sparse, gpu_encode_dense\n\n    values, indices, indptr, n_cats = gpu_encode_sparse(data)\n    dense_matrix = gpu_encode_dense(data)  # for small K\n```\n\n### Thread control\n\n```python\nfrom ohe_rs import set_threads\nset_threads(4)  # Limit to 4 threads\n```\n\n## Architecture\n\n```\nInput (Python numpy array)\n         |\n         v\n+----------------------------+\n|  Rust Core (PyO3 bindings) |\n|                            |\n|  1. Category Discovery     |\n|     rayon parallel chunks  |\n|     FxHashMap per-thread   |\n|     + sequential merge     |\n|                            |\n|  2. Encoding               |\n|     CPU: parallel write    |\n|     GPU: CUDA kernel       |\n|                            |\n|  3. Output                 |\n|     Sparse CSR (zero-copy) |\n|     Dense ndarray          |\n+----------------------------+\n         |\n         v\nscipy.sparse.csr_matrix / np.ndarray\n```\n\n### Why CPU beats GPU here\n\nOne-hot encoding is **memory-bound**, not compute-bound. Each element requires:\n- 1 hash lookup (category mapping)\n- 1 memory write (set the bit)\n\nThe GPU kernel itself runs in microseconds, but the host-to-device transfer of N int64 values (~80 MB for 10M rows) dominates the total time. GPU wins when:\n- Data is **already on the GPU** (e.g., in a cuML/PyTorch pipeline)\n- You combine OHE with other GPU operations, amortizing the transfer cost\n\n## Development\n\n```bash\n# Build\ncargo build --release\n\n# Tests\ncargo test\n\n# Build with CUDA\nCUDA_ROOT=/usr/local/cuda cargo build --release --features cuda\n\n# Python development install\nmaturin develop --release\n\n# Run benchmarks\npython benchmark.py\n```\n\n## License\n\nMIT\n\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fgenpat-it%2Fohe-rs","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Fgenpat-it%2Fohe-rs","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fgenpat-it%2Fohe-rs/lists"}