{"id":21962512,"url":"https://github.com/CellArr/cellarr","last_synced_at":"2025-07-22T13:32:09.630Z","repository":{"id":242887000,"uuid":"810894591","full_name":"BiocPy/cellarr","owner":"BiocPy","description":"Store collections of experimental data based on TileDB","archived":false,"fork":false,"pushed_at":"2024-11-25T22:47:02.000Z","size":983,"stargazers_count":2,"open_issues_count":11,"forks_count":1,"subscribers_count":3,"default_branch":"master","last_synced_at":"2024-11-25T23:28:43.715Z","etag":null,"topics":["ml","tiledb"],"latest_commit_sha":null,"homepage":"https://biocpy.github.io/cellarr/","language":"Python","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":"mit","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/BiocPy.png","metadata":{"files":{"readme":"README.md","changelog":"CHANGELOG.md","contributing":"CONTRIBUTING.md","funding":null,"license":"LICENSE.txt","code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null,"roadmap":null,"authors":"AUTHORS.md","dei":null,"publiccode":null,"codemeta":null}},"created_at":"2024-06-05T14:50:24.000Z","updated_at":"2024-11-22T22:52:43.000Z","dependencies_parsed_at":"2024-06-17T15:48:53.827Z","dependency_job_id":"fd0a9adc-d2a8-4c9c-b538-dd580fe954cf","html_url":"https://github.com/BiocPy/cellarr","commit_stats":null,"previous_names":["biocpy/cellarr"],"tags_count":16,"template":false,"template_full_name":null,"repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/BiocPy%2Fcellarr","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/BiocPy%2Fcellarr/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/BiocPy%2Fcellarr/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/BiocPy%2Fcellarr/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/BiocPy","download_url":"https://codeload.github.com/BiocPy/cellarr/tar.gz/refs/heads/master","host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":227101523,"owners_count":17731143,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["ml","tiledb"],"created_at":"2024-11-29T10:42:41.579Z","updated_at":"2025-07-22T13:32:09.624Z","avatar_url":"https://github.com/BiocPy.png","language":"Python","funding_links":[],"categories":[],"sub_categories":[],"readme":"[![PyPI-Server](https://img.shields.io/pypi/v/cellarr.svg)](https://pypi.org/project/cellarr/)\n![Unit tests](https://github.com/TileOme/cellarr/actions/workflows/run-tests.yml/badge.svg)\n\n# Cell Arrays\n\nCell Arrays is a Python package that provides a TileDB-backed store for large collections of genomic experimental data, such as millions of cells across multiple single-cell experiment objects.\n\nThe `CellArrDataset` is designed to store single-cell RNA-seq\ndatasets but can be generalized to store any 2-dimensional experimental data.\n\n\u003e [!NOTE]\n\u003e\n\u003e Check out the tutorial using cellXgene datasets [here](https://cellarr.github.io/cellarr/tutorial_cellxgene.html).\n\n\n## Install\n\nTo get started, install the package from [PyPI](https://pypi.org/project/cellarr/)\n\n```sh\npip install cellarr\n\n## to include optional dependencies\npip install cellarr[optional]\n```\n\n## Usage\n\n### Build a `CellArrDataset`\n\nBuilding a `CellArrDataset` generates 4 TileDB files in the specified output directory:\n\n- `gene_annotation`: A TileDB file containing feature/gene annotations.\n- `sample_metadata`: A TileDB file containing sample metadata.\n- `cell_metadata`: A TileDB file containing cell metadata including mapping to the samples\n  they are tagged with in `sample_metadata`.\n- An `assay` TileDB group containing various matrices. This allows the package to store multiple different matrices, e.g. 'counts', 'normalized', 'scaled' for the same sample/cell and gene attributes.\n\nThe organization is inspired by Bioconductor's `SummarizedExperiment` data structure.\n\nThe TileDB matrix file is stored in a **cell X gene** orientation. This orientation\nis chosen because the fastest-changing dimension as new files are added to the\ncollection is usually the cells rather than genes.\n\n![`CellArrDataset` structure](./assets/cellarr.png \"CellArrDataset\")\n\n**_Note: Currently only supports either paths to H5AD or `AnnData` objects_**\n\nTo build a `CellArrDataset` from a collection of `H5AD` or `AnnData` objects:\n\n```python\nimport anndata\nimport numpy as np\nimport tempfile\nfrom cellarr import build_cellarrdataset, CellArrDataset, MatrixOptions\n\n# Create a temporary directory, this is where the\n# output files are created. Pick your location here.\ntempdir = tempfile.mkdtemp()\n\n# Read AnnData objects\nadata1 = anndata.read_h5ad(\"path/to/object1.h5ad\", \"r\")\n# or just provide the path\nadata2 = \"path/to/object2.h5ad\"\n\n# Build CellArrDataset\ndataset = build_cellarrdataset(\n    output_path=tempdir,\n    files=[adata1, adata2],\n    matrix_options=MatrixOptions(matrix_name=\"counts\", dtype=np.int16),\n    num_threads=2,\n)\n\n# Or if the objects contain multiple assays\ndataset = build_cellarrdataset(\n    output_path=tempdir,\n    files=[adata1, adata2],\n    matrix_options=[\n        MatrixOptions(matrix_name=\"counts\", dtype=np.int16),\n        MatrixOptions(matrix_name=\"log-norm\", dtype=np.float32)\n    ],\n    num_threads=2,\n)\n```\n\nThe build process usually involves 4 steps:\n\n1. **Scan the Collection**: Scan the entire collection of files to create\n   a unique set of feature ids (e.g. gene symbols). Store this set as the\n   `gene_annotation` TileDB file.\n\n2. **Sample Metadata**: Store sample metadata in `sample_metadata`\n   TileDB file. Each file is typically considered a sample, and an automatic\n   mapping is created between files and samples if metadata is not provided.\n\n3. **Store Cell Metadata**: Store cell metadata in the `cell_metadata`\n   TileDB file.\n\n4. **Remap and Orient Data**: For each dataset in the collection,\n   remap and orient the feature dimension using the feature set from Step 1.\n   This step ensures consistency in gene measurement and order, even if\n   some genes are unmeasured or ordered differently in the original experiments.\n\n**_Note: The objects to build the `CellArrDataset` are expected to be fairly consistent, especially along the feature dimension.\nif these are `AnnData` or `H5AD`objects, all objects must contain an index (in the `var` slot) specifying the gene symbols._**\n\n#### Optionally provide cell metadata columns\n\nIf the cell metadata is inconsistent across datasets, you may provide a list of\ncolumns to standardize during extraction. Any missing columns will be filled with\nthe default value `'NA'`, and their data type should be specified as `'ascii'` in\n`CellMetadataOptions`. For example, this build process will create a TileDB store\nfor cell metadata containing the columns `'cellids'` and `'tissue'`. If any dataset\nlacks one of these columns, the missing values will be automatically filled with `'NA'`.\n\n```python\ndataset = build_cellarrdataset(\n    output_path=tempdir,\n    files=[adata1, adata2],\n    matrix_options=MatrixOptions(dtype=np.float32),\n    cell_metadata_options=CellMetadataOptions(\n        column_types={\"cellids\": \"ascii\", \"tissue\": \"ascii\"}\n    ),\n)\n\nprint(dataset)\n```\n\nCheck out the [documentation](https://cellarr.github.io/cellarr/tutorial.html) for more details.\n\n### Building on HPC environments with `slurm`\n\nTo simplify building TileDB files on HPC environments that use `slurm`, there are a few steps you need to follow.\n\n- Step 1: Construct a manifest file\n  A minimal manifest file (json) must contain the following fields\n- `\"files\"`: A list of file path to the input `h5ad` objects.\n- `\"python_env\"`: A set of commands to activate the Python environment containing this package and its dependencies.\n\nHere’s an example of the manifest file:\n\n```py\nmanifest = {\n    \"files\": your/list/of/files,\n    \"python_env\": \"\"\"\nml Miniforge3\nconda activate cellarr\n\npython --version\nwhich python\n    \"\"\",\n    \"matrix_options\": [\n        {\n            \"matrix_name\": \"non_zero_cells\",\n            \"dtype\": \"uint32\"\n        },\n        {\n            \"matrix_name\": \"pseudo_bulk_log_normed\",\n            \"dtype\": \"float32\"\n        }\n    ],\n}\n\nimport json\njson.dump(manifest, open(\"your/path/to/manifest.json\", \"w\"))\n```\n\nFor more options, check out the [README](./src/cellarr/slurm/README.md).\n\n- Step 2: Submit the job\n  Once your manifest file is ready, you can submit the necessary jobs using the `cellarr_build` CLI. Run the following command:\n\n```sh\ncellarr_build --input-manifest your/path/to/manifest.json --output-dir your/path/to/output --memory-per-job 8 --cpus-per-task 2\n```\n\n### Query a `CellArrDataset`\n\nUsers have the option to reuse the `dataset` object returned when building the dataset or by creating a `CellArrDataset` object by initializing it to the path where the files were created.\n\n```python\n# Create a CellArrDataset object from the existing dataset\ndataset = CellArrDataset(dataset_path=tempdir)\n\n# Query data from the dataset\ngene_list = [\"gene_1\", \"gene_95\", \"gene_50\"]\nexpression_data = dataset[0:10, gene_list]\n\nprint(expression_data.matrix)\n\nprint(expression_data.gene_annotation)\n```\n\n     ## output 1\n     \u003c11x3 sparse matrix of type '\u003cclass 'numpy.float32'\u003e'\n          with 9 stored elements in COOrdinate format\u003e\n\n     ## output 2\n     \tcellarr_gene_index\n     0\tgene_1\n     446\tgene_50\n     945\tgene_95\n\nThis returns a `CellArrDatasetSlice` object that contains the matrix and metadata `DataFrame`'s along the cell and gene axes.\n\nUsers can easily convert these to analysis-ready representations\n\n```python\nprint(\"as anndata:\")\nprint(expression_data.to_anndata())\n\nprint(\"\\n\\n as summarizedexperiment:\")\nprint(expression_data.to_summarizedexperiment())\n```\n\n### A built-in dataloader for the `pytorch-lightning` framework\n\nThe package includes a dataloader in the `pytorch-lightning` framework for single cells expression profiles, training labels, and study labels. The dataloader uniformly samples across training labels and study labels to create a diverse batch of cells.\n\nThis dataloader can be used as a **template** to create custom dataloaders specific to your needs.\n\n```python\nfrom cellarr.ml.dataloader import DataModule\n\ndatamodule = DataModule(\n    dataset_path=\"/path/to/cellar/dir\",\n    cell_metadata_uri=\"cell_metadata\",\n    gene_annotation_uri=\"gene_annotation\",\n    matrix_uri=\"assays/counts\",\n    label_column_name=\"label\",\n    study_column_name=\"study\",\n    batch_size=1000,\n    lognorm=True,\n    target_sum=1e4,\n)\n```\n\nThe package also includes a simple autoencoder in the `pytorch-lightning` which makes use of the dataloader. This can be used as a template to create custom architectures and models.\n\n```python\nimport pytorch_lightning as pl\nfrom cellarr.ml.autoencoder import AutoEncoder\n\nautoencoder = AutoEncoder(\n    n_genes=len(datamodule.gene_indices),\n    latent_dim=128,\n    hidden_dim=[1024, 1024, 1024],\n    dropout=0.5,\n    input_dropout=0.4,\n    residual=False,\n)\n\nmodel_path = \"/path/to/model/mymodel/\"\nparams = {\n    \"max_epochs\": 500,\n    \"logger\": True,\n    \"log_every_n_steps\": 1,\n    \"limit_train_batches\": 100, # to specify number of batches per epoch\n}\ntrainer = pl.Trainer(**params)\ntrainer.fit(autoencoder, datamodule=datamodule)\nautoencoder.save_all(model_path=model_path)\n```\n\nCheck out the [documentation](https://cellarr.github.io/cellarr/api/modules.html) for more details.\n\n\u003c!-- pyscaffold-notes --\u003e\n\n## Note\n\nThis project has been set up using PyScaffold 4.5. For details and usage\ninformation on PyScaffold see \u003chttps://pyscaffold.org/\u003e.\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2FCellArr%2Fcellarr","html_url":"https://awesome.ecosyste.ms/projects/github.com%2FCellArr%2Fcellarr","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2FCellArr%2Fcellarr/lists"}