{"id":31540985,"url":"https://github.com/scdataset/scdataset","last_synced_at":"2025-10-08T17:10:41.934Z","repository":{"id":294466913,"uuid":"986992951","full_name":"scDataset/scDataset","owner":"scDataset","description":"scDataset: Scalable Data Loading for Deep Learning on Large-Scale Single-Cell Omics","archived":false,"fork":false,"pushed_at":"2025-09-04T14:37:55.000Z","size":2839,"stargazers_count":31,"open_issues_count":1,"forks_count":1,"subscribers_count":1,"default_branch":"main","last_synced_at":"2025-10-04T10:32:15.774Z","etag":null,"topics":["big-data","bioinformatics","deep-learning","machine-learning","omics","pytorch","rna-seq","single-cell"],"latest_commit_sha":null,"homepage":"","language":"Python","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":"mit","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/scDataset.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":null,"license":"LICENSE","code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null,"roadmap":null,"authors":null,"dei":null,"publiccode":null,"codemeta":null,"zenodo":null}},"created_at":"2025-05-20T12:20:07.000Z","updated_at":"2025-09-30T02:57:03.000Z","dependencies_parsed_at":"2025-05-20T15:37:00.059Z","dependency_job_id":"a3cfc1f7-b4e5-4103-9733-108cf531a56c","html_url":"https://github.com/scDataset/scDataset","commit_stats":null,"previous_names":["kidara/scdataset","scdataset/scdataset"],"tags_count":0,"template":false,"template_full_name":null,"purl":"pkg:github/scDataset/scDataset","repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/scDataset%2FscDataset","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/scDataset%2FscDataset/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/scDataset%2FscDataset/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/scDataset%2FscDataset/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/scDataset","download_url":"https://codeload.github.com/scDataset/scDataset/tar.gz/refs/heads/main","sbom_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/scDataset%2FscDataset/sbom","scorecard":null,"host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":278981518,"owners_count":26079640,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","status":"online","status_checked_at":"2025-10-08T02:00:06.501Z","response_time":56,"last_error":null,"robots_txt_status":"success","robots_txt_updated_at":"2025-07-24T06:49:26.215Z","robots_txt_url":"https://github.com/robots.txt","online":true,"can_crawl_api":true,"host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["big-data","bioinformatics","deep-learning","machine-learning","omics","pytorch","rna-seq","single-cell"],"created_at":"2025-10-04T10:24:34.427Z","updated_at":"2025-10-08T17:10:41.929Z","avatar_url":"https://github.com/scDataset.png","language":"Python","funding_links":[],"categories":[],"sub_categories":[],"readme":"# scDataset\n\n[![PyPI version](https://badge.fury.io/py/scDataset.svg)](https://pypi.org/project/scDataset/)\n[![License: MIT](https://img.shields.io/badge/License-MIT-green.svg)](https://opensource.org/licenses/MIT)\n[![arXiv](https://img.shields.io/badge/arXiv-2506.01883-b31b1b.svg)](https://arxiv.org/abs/2506.01883)\n\nScalable Data Loading for Deep Learning on Large-Scale Single-Cell Omics\n\n---\n\n![scDataset architecture](https://github.com/Kidara/scDataset/raw/main/figures/scdataset.png)\n\n**scDataset** is a flexible and efficient PyTorch `IterableDataset` for large-scale single-cell omics datasets. It supports a variety of data formats (e.g., AnnData, HuggingFace Datasets, NumPy arrays) and is designed for high-throughput deep learning workflows. While optimized for single-cell data, it is general-purpose and can be used with any dataset.\n\n## Features\n\n- **Flexible Data Source Support**: Integrates seamlessly with AnnData, HuggingFace Datasets, NumPy arrays, PyTorch Datasets, and more.\n- **Scalable**: Handles datasets with billions of samples without loading everything into memory.\n- **Efficient Data Loading**: Block sampling and batched fetching optimize random access for large datasets.\n- **Dynamic Splitting**: Split datasets into train/validation/test dynamically, without duplicating data or rewriting files.\n- **Custom Hooks**: Apply transformations at fetch or batch time via user-defined callbacks.\n\n## Installation\n\nInstall the latest release from PyPI:\n\n```bash\npip install scDataset\n```\n\nOr install the latest development version from GitHub:\n\n```bash\npip install git+https://github.com/Kidara/scDataset.git\n```\n\n## Usage\n\n### Basic Usage with Sampling Strategies\n\nscDataset v0.2.0 uses a strategy-based approach for flexible data sampling:\n\n```python\nfrom scdataset import scDataset, Streaming\nfrom torch.utils.data import DataLoader\n\n# Create dataset with streaming strategy\ndata = my_data_collection  # Any indexable object (numpy array, AnnData, etc.)\nstrategy = Streaming()\ndataset = scDataset(data, strategy, batch_size=64)\nloader = DataLoader(dataset, batch_size=None)  # scDataset handles batching internally\n```\n\n\u003e **Note:** Set `batch_size=None` in the DataLoader to delegate batching to `scDataset`.\n\n### Sampling Strategies\n\n#### Sequential Sampling (Streaming)\n```python\nfrom scdataset import Streaming\n\n# Simple sequential access\nstrategy = Streaming()\ndataset = scDataset(data, strategy, batch_size=64)\n\n# Sequential with buffer-level shuffling (like Ray Dataset or WebDataset). The buffer size is equal to batch_size * fetch_factor (defined in the scDataset init)\nstrategy = Streaming(shuffle=True)\ndataset = scDataset(data, strategy, batch_size=64, fetch_factor=8)\n\n# Use only a subset of indices\ntrain_indices = [0, 2, 4, 6, 8, ...]  # Your training indices\nstrategy = Streaming(indices=train_indices)\ndataset = scDataset(data, strategy, batch_size=64)\n```\n\n#### Block Shuffling for Locality\n```python\nfrom scdataset import BlockShuffling\n\n# Shuffle blocks while maintaining some data locality\nstrategy = BlockShuffling(block_size=8)\ndataset = scDataset(data, strategy, batch_size=64)\n\n# With subset of indices\nstrategy = BlockShuffling(block_size=8, indices=train_indices)\ndataset = scDataset(data, strategy, batch_size=64)\n```\n\n#### Weighted Sampling\n```python\nfrom scdataset import BlockWeightedSampling\n\n# Uniform weighted sampling\nstrategy = BlockWeightedSampling(total_size=10000, block_size=16)\ndataset = scDataset(data, strategy, batch_size=64)\n\n# Custom weights (e.g., for imbalanced data)\nsample_weights = compute_weights(data)  # Your weight computation\nstrategy = BlockWeightedSampling(\n    weights=sample_weights, \n    total_size=5000,\n    block_size=16\n)\ndataset = scDataset(data, strategy, batch_size=64)\n```\n\n#### Automatic Class Balancing\n```python\nfrom scdataset import ClassBalancedSampling\n\n# Automatically balance classes from labels\ncell_types = ['T_cell', 'B_cell', 'NK_cell', ...]  # Your class labels\nstrategy = ClassBalancedSampling(cell_types, total_size=8000)\ndataset = scDataset(data, strategy, batch_size=64)\n```\n\n### Multi-Modal Data with MultiIndexable\n\nHandle multiple related data modalities that need to be indexed together:\n\n```python\nfrom scdataset import MultiIndexable, Streaming\n\n# Group multiple data modalities\nmulti_data = MultiIndexable(\n    genes=gene_expression_data,    # Shape: (n_cells, n_genes)\n    proteins=protein_data,         # Shape: (n_cells, n_proteins)  \n    metadata=cell_metadata         # Shape: (n_cells, n_features)\n)\n\n# Use with any sampling strategy\nstrategy = Streaming()\ndataset = scDataset(multi_data, strategy, batch_size=64)\n\nfor batch in dataset:\n    genes = batch['genes']       # Gene expression for this batch\n    proteins = batch['proteins'] # Corresponding protein data\n    metadata = batch['metadata'] # Corresponding metadata\n```\n\n### Performance Optimization\n\nConfigure `fetch_factor` to fetch multiple batches worth of data at once:\n\n```python\nstrategy = BlockShuffling(block_size=16)\ndataset = scDataset(\n    data, \n    strategy, \n    batch_size=64,\n    fetch_factor=8  # Fetch 8*64=512 samples at once\n)\nloader = DataLoader(\n    dataset,\n    batch_size=None,\n    num_workers=4,\n    prefetch_factor=9  # fetch_factor + 1\n)\n```\n\nWe recommend setting `prefetch_factor` to `fetch_factor + 1` for efficient data loading. For parameter details, see the [original paper](https://arxiv.org/abs/2506.01883).\n\n### Custom Transforms and Callbacks\n\nApply custom transformations at fetch or batch time using the new callback system:\n\n#### Transform Overview\n\n- **`fetch_callback(collection, indices)`**:  \n  Customizes how samples are fetched from the underlying data collection.  \n  Use this if your collection does not support batched indexing or requires special access logic.  \n  - **Input:** the data collection and an array of indices  \n  - **Output:** the fetched data\n\n- **`fetch_transform(fetched_data)`**:  \n  Transforms each fetched chunk (e.g., sparse-to-dense conversion, normalization).  \n  - **Input:** the fetched data  \n  - **Output:** the transformed data\n\n- **`batch_callback(fetched_data, batch_indices)`**:  \n  Selects or arranges a minibatch from the fetched/transformed data.\n  - **Input:** the fetched/transformed data and a list of batch indices within the chunk\n  - **Output:** the batch to yield\n\n- **`batch_transform(batch)`**:  \n  Applies final processing to each batch before yielding (e.g., collation, augmentation).  \n  - **Input:** the batch  \n  - **Output:** the processed batch\n\n```python\nfrom scdataset import scDataset, Streaming\n\ndef fetch_transform(chunk):\n    # Example: convert sparse to dense, normalization, etc.\n    # Applied to entire fetched chunk\n    return chunk.toarray() if hasattr(chunk, 'toarray') else chunk\n\ndef batch_transform(batch):\n    # Example: batch-level augmentation or tensor conversion\n    import torch\n    return torch.from_numpy(batch).float()\n\nstrategy = Streaming()\ndataset = scDataset(\n    data,\n    strategy,\n    batch_size=64,\n    fetch_transform=fetch_transform,\n    batch_transform=batch_transform\n)\n```\n\n#### Complete Example with Multiple Strategies\n\n```python\nfrom scdataset import scDataset, BlockShuffling, Streaming\nfrom torch.utils.data import DataLoader\nimport numpy as np\n\n# Your data\ndata = my_data_collection\ntrain_indices = np.arange(0, 8000)\nval_indices = np.arange(8000, 10000)\n\n# Training with block shuffling\ntrain_strategy = BlockShuffling(block_size=32, indices=train_indices)\ntrain_dataset = scDataset(\n    data,\n    train_strategy,\n    batch_size=64,\n    fetch_factor=8\n)\n\ntrain_loader = DataLoader(\n    train_dataset,\n    batch_size=None,\n    num_workers=4,\n    prefetch_factor=9\n)\n\n# Validation with streaming (deterministic)\nval_strategy = Streaming(indices=val_indices)\nval_dataset = scDataset(\n    data,\n    val_strategy,\n    batch_size=64,\n    fetch_factor=8\n)\n\nval_loader = DataLoader(\n    val_dataset,\n    batch_size=None,\n    num_workers=4,\n    prefetch_factor=9\n)\n\n# Training loop\nfor epoch in range(num_epochs):\n    # Training\n    for batch in train_loader:\n        # Training code here\n        pass\n    \n    # Validation  \n    for batch in val_loader:\n        # Validation code here\n        pass\n```\n\n## Citing\n\nIf you use `scDataset` in your research, please cite the following paper:\n\n```bibtex\n@article{scdataset2025,\n  title={scDataset: Scalable Data Loading for Deep Learning on Large-Scale Single-Cell Omics},\n  author={D'Ascenzo, Davide and Cultrera di Montesano, Sebastiano},\n  journal={arXiv:2506.01883},\n  year={2025}\n}\n```\n\n## Migration from v0.1.x to v0.2.0\n\nscDataset v0.2.0 introduces breaking changes with a new strategy-based API. Here's how to migrate your code:\n\n### Old v0.1.x API\n```python\n# v0.1.x - No longer supported\nfrom scdataset import scDataset\n\ndataset = scDataset(data, batch_size=64, block_size=8, fetch_factor=4)\ndataset.subset(train_indices)\ndataset.set_mode('train')\n```\n\n### New v0.2.0 API\n```python\n# v0.2.0 - Strategy-based approach\nfrom scdataset import scDataset, BlockShuffling, Streaming\n\n# Training with shuffling\ntrain_strategy = BlockShuffling(block_size=8, indices=train_indices)\ntrain_dataset = scDataset(data, train_strategy, batch_size=64, fetch_factor=4)\n\n# Evaluation with streaming\nval_strategy = Streaming(indices=val_indices)\nval_dataset = scDataset(data, val_strategy, batch_size=64, fetch_factor=4)\n```\n\n**Key Changes:**\n- **Required strategy parameter**: Must provide a `SamplingStrategy` instance\n- **No more `subset()` and `set_mode()`**: Use strategy `indices` parameter and different strategy types\n- **Create separate datasets**: For different splits instead of modifying a single instance\n- **New import**: Import specific strategies like `Streaming`, `BlockShuffling`, etc.\n\n## License\n\nThis project is licensed under the MIT License.\n\n## Contributing\n\nContributions are welcome! Please open issues or pull requests on [GitHub](https://github.com/Kidara/scDataset).\n\n---\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fscdataset%2Fscdataset","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Fscdataset%2Fscdataset","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fscdataset%2Fscdataset/lists"}