{"id":50759378,"url":"https://github.com/debanjan06/spatial-streamio","last_synced_at":"2026-06-11T08:30:49.921Z","repository":{"id":361475990,"uuid":"1254525916","full_name":"debanjan06/spatial-streamio","owner":"debanjan06","description":"An optimized, out-of-core asynchronous data streaming pipeline for high-throughput 3D point cloud training loops. Features low-level numpy.memmap zero-copy reads and multi-threaded ring prefetching to eliminate I/O bottlenecks, delivering a 33.33% throughput efficiency gain on PyTorch CUDA workloads.","archived":false,"fork":false,"pushed_at":"2026-05-30T20:19:14.000Z","size":12,"stargazers_count":0,"open_issues_count":0,"forks_count":0,"subscribers_count":0,"default_branch":"main","last_synced_at":"2026-05-30T21:13:40.099Z","etag":null,"topics":["asynchronous-programming","cuda","data-engineering","deep-learning-pipelines","io-optimization","memory-mapping","point-cloud","pytorch"],"latest_commit_sha":null,"homepage":"","language":"Python","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":null,"status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/debanjan06.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":null,"license":null,"code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null,"roadmap":null,"authors":null,"dei":null,"publiccode":null,"codemeta":null,"zenodo":null,"notice":null,"maintainers":null,"copyright":null,"agents":null,"dco":null,"cla":null}},"created_at":"2026-05-30T17:17:17.000Z","updated_at":"2026-05-30T20:19:18.000Z","dependencies_parsed_at":null,"dependency_job_id":null,"html_url":"https://github.com/debanjan06/spatial-streamio","commit_stats":null,"previous_names":["debanjan06/spatial-streamio"],"tags_count":null,"template":false,"template_full_name":null,"purl":"pkg:github/debanjan06/spatial-streamio","repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/debanjan06%2Fspatial-streamio","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/debanjan06%2Fspatial-streamio/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/debanjan06%2Fspatial-streamio/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/debanjan06%2Fspatial-streamio/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/debanjan06","download_url":"https://codeload.github.com/debanjan06/spatial-streamio/tar.gz/refs/heads/main","sbom_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/debanjan06%2Fspatial-streamio/sbom","scorecard":null,"host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":286080680,"owners_count":34190582,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2026-05-26T15:22:16.424Z","status":"online","status_checked_at":"2026-06-11T02:00:06.485Z","response_time":57,"last_error":null,"robots_txt_status":"success","robots_txt_updated_at":"2025-07-24T06:49:26.215Z","robots_txt_url":"https://github.com/robots.txt","online":true,"can_crawl_api":true,"host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["asynchronous-programming","cuda","data-engineering","deep-learning-pipelines","io-optimization","memory-mapping","point-cloud","pytorch"],"created_at":"2026-06-11T08:30:47.141Z","updated_at":"2026-06-11T08:30:49.915Z","avatar_url":"https://github.com/debanjan06.png","language":"Python","funding_links":[],"categories":[],"sub_categories":[],"readme":"# Spatial-StreamIO\n\nAn optimized, out-of-core asynchronous data streaming pipeline designed for high-throughput training loops on massive 3D point cloud datasets.\n\nBy leveraging low-level memory-mapped file access (`numpy.memmap`) and multi-threaded ring-buffer prefetching, Spatial-StreamIO eliminates I/O bottlenecks during deep learning model execution, achieving a **33.33% pipeline throughput optimization** over standard sequential loaders when processing millions of production points on active CUDA GPU systems.\n\n## Features\n\n- **Zero-Copy Out-of-Core Processing**: Maps dense point cloud matrices directly to virtual memory space instead of loading complete gigabyte-scale datasets into system RAM all at once.\n- **Multi-Threaded Ring Prefetching**: Utilizes background worker threads to read and stage the next data batch in a dedicated queue while the GPU computes the current training iteration.\n- **Thread-Safe Queue Management**: Robust synchronization prevents data loss or batch skipping, allowing background workers to block naturally until queue slots are freed.\n- **Production Epoch Signaling**: Implements deterministic `None` sentinel token handshakes to ensure clean epoch boundaries across continuous evaluation runs.\n- **Flexible Schema Parsing**: Streams complex raw spatial formats including X, Y, Z coordinates alongside intensity, semantic class, and instance class fields.\n\n## System Architecture\n\nThe pipeline decouples disk-read operations from the GPU execution timeline, hiding file access latency behind active compute windows:\n\n```\n[ Disk Binary File ]\n        |\n        v\n[ np.memmap View ] -----\u003e [ Background Prefetch Thread ]\n                                       |\n                            [ Thread-Safe Queue ]\n                                       |\n                                       v\n[ CUDA GPU ] \u003c---- [ PyTorch Tensor ] \u003c---- [ Main Training Loop ]\n```\n\n## Performance Benchmark\n\nTested on a production-grade workload processing **36,831,590 dense spatial records** paired with an active PyTorch CUDA tensor computation backbone:\n\n| Loader | Duration |\n|---|---|\n| Standard Sequential Baseline | 1.5356s |\n| Spatial-StreamIO Pipeline | 1.0238s |\n| **Efficiency Gain** | **33.33% improvement** |\n\n\u003e Benchmarked on real LiDAR point cloud tiles with 6 features per point (X, Y, Z, intensity, sem_class, ins_class) processed through a PyTorch linear backbone on an active CUDA device.\n\n## Repository Structure\n\n```text\nspatial-streamio/\n│\n├── spatial_streamio/\n│   ├── __init__.py\n│   ├── memory.py        # Low-level virtual memory mapping engine\n│   └── pipeline.py      # Asynchronous background queue orchestrator\n│\n├── data/                # Storage directory for compiled production binaries (.bin)\n├── tests/               # PyTest integration test suite\n└── benchmark.py         # Comparative evaluation suite running PyTorch CUDA layers\n```\n\n## Getting Started\n\n### Prerequisites\n\n```bash\npip install numpy torch plyfile pytest\n```\n\n### Running the Benchmark\n\n1. Place your `.ply` point cloud files inside the `data/` folder.\n2. Run the benchmark script to measure efficiency gains on your hardware:\n\n```bash\npython benchmark.py\n```\n\n## Core Implementation\n\n### Memory Mapping (`spatial_streamio/memory.py`)\n\n```python\nself.mmap_array = np.memmap(\n    self.file_path,\n    dtype=self.dtype,\n    mode='r',\n    shape=(self.num_points, self.num_features)\n)\n```\n\n### Prefetch Queue (`spatial_streamio/pipeline.py`)\n\n```python\n# Blocking insertion ensures zero-loss data synchronization\nself.queue.put(batch_buffer, block=True, timeout=self.timeout)\n```\n\n## License\n\nThis project is open-source and available under the MIT License.\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fdebanjan06%2Fspatial-streamio","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Fdebanjan06%2Fspatial-streamio","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fdebanjan06%2Fspatial-streamio/lists"}