{"id":24143814,"url":"https://github.com/dnbaker/scavenger","last_synced_at":"2026-05-13T17:38:04.117Z","repository":{"id":235939814,"uuid":"629743627","full_name":"dnbaker/scavenger","owner":"dnbaker","description":"Rust spatial/single-cell genomics","archived":false,"fork":false,"pushed_at":"2024-04-25T00:23:34.000Z","size":209,"stargazers_count":1,"open_issues_count":0,"forks_count":0,"subscribers_count":1,"default_branch":"main","last_synced_at":"2025-03-01T14:32:55.646Z","etag":null,"topics":[],"latest_commit_sha":null,"homepage":null,"language":"Rust","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":null,"status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/dnbaker.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":null,"license":null,"code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null,"roadmap":"Roadmap.md","authors":null,"dei":null,"publiccode":null,"codemeta":null}},"created_at":"2023-04-19T00:05:06.000Z","updated_at":"2024-04-25T07:31:33.000Z","dependencies_parsed_at":null,"dependency_job_id":"2f4e69b5-5dac-41cb-bb7e-d3b0c988732e","html_url":"https://github.com/dnbaker/scavenger","commit_stats":null,"previous_names":["dnbaker/scavenger"],"tags_count":0,"template":false,"template_full_name":null,"purl":"pkg:github/dnbaker/scavenger","repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/dnbaker%2Fscavenger","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/dnbaker%2Fscavenger/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/dnbaker%2Fscavenger/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/dnbaker%2Fscavenger/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/dnbaker","download_url":"https://codeload.github.com/dnbaker/scavenger/tar.gz/refs/heads/main","sbom_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/dnbaker%2Fscavenger/sbom","scorecard":null,"host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":286020955,"owners_count":27272089,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","status":"online","status_checked_at":"2025-11-23T02:00:06.149Z","response_time":135,"last_error":null,"robots_txt_status":"success","robots_txt_updated_at":"2025-07-24T06:49:26.215Z","robots_txt_url":"https://github.com/robots.txt","online":true,"can_crawl_api":true,"host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":[],"created_at":"2025-01-12T05:45:34.822Z","updated_at":"2025-11-23T21:02:13.622Z","avatar_url":"https://github.com/dnbaker.png","language":"Rust","funding_links":[],"categories":[],"sub_categories":[],"readme":"This is a playground testing some basic VAE ideas applied to sparse count data, including full covariance latent space.\n\nNone of it is ideal for production use, and it is not integrated with scvi-hub or similar; however, the code works and can serve as building blocks.\n\nIt is also implemented in both Python via pytorch and libtorch via tch-rs in Rust. This allows us to eliminate the python runtime for applications.\n\n##Contents\n\n1. VAE with negative binomial likelihood model with diagonal or full covariance latent model.\n2. Batch correction:\n  1. One-hot encoded class data helps correct for batch effects in reconstruction\n  2. Allowance of softmax/uncertain classification allows semi-supervised learning.\n  3. Optional classifier loss helps the model distinguish as well, giving us two locations for batch correction in the model.\n3. Use of equivariant GNN for spatially-resolved patches. The aim is to use VAE-based embeddings in the spatial model.\n  1. Code in `scavenger/experimental/equiformer.py`.\n\n\n## Quick-start\n\nFor count-based data, see `train.py` for an example. \n\n`scavenger.NBVAE` is the class you'll want to work with. Use `full_cov={True, False}` to choose between the isotropic variance and full covariance matrix options.\n\nThe diagonal covariance is simpler in latent space and may pull our more independent information. But the full diagonal model has higher capacity and yields a much higher likelihood on real data.\n\n```\nfrom scavenger import NBVAE\n\ndata_dim = 32768 # Your number\nlatent_dim = 128\nhidden_dim = [512, 256]\nmodel = NBVAE(data_dim, latent_dim=latent_dim, hidden_dim=hidden_dim, full_cov=True)\n\n\n# Let data = batch (N, Dim)\npacked_output = model(data)\nlatent, losses, zinb, class_info = packed_output\n# zinb is the model for the data provided, latent is the latent representations\n# latent_repr = (N, LatentDim)\n# sampled_repr = (N, LatentDim) - latent + Gaussian noise\n# nb_model: scavenger.ZINB model.\n# logvar: log variance\n# This is always used.\n\n# full_cov: (N, LatentDim * LatentDim) - expanded covariance matrix.\n# This is None if full covariance is not enabled.\n\n# The diagonal of this matrix is the exponent of logvar\nlatent_repr, sampled_repr, nb_model, logvar, full_cov = latent\n\n# Model loss: kl divergence of reparameterization\n# Reconstruction loss: negative log-likelihood of model reconstruction.\nmodel_loss, reconstruction_loss = losses[:2]\n\n\n# If classification labels were provided, class_info will have (`class_logits`, `class_loss`).\n# Otherwise, it will be None.\n# You can use it to see how clearly the sample belonged to a particular group.\n# And you can backpropagate from `class_loss` to teach the model to reconstruct categorical labels as well.\n\n```\n\n\n### Batch correction/integration\nFor batch correction, use `categorical_class_sizes=[num_batches]` when constructed NBVAE, and add a one-hot encoded label.\nIf you have additional categorical labels (spatial data, atac-seq, microarray/short/long read, library prep), add them to the list, too.\n\nFor instance `categorical_class_sizes=[num_batches, num_experiment_types, num_library_preps]`.\n\nThen, when calling forward on the model,\n\nFor example:\n\n```python3\ndata_dim = 32768 # Your number\nlatent_dim = 128\nhidden_dim = [512, 256]\nmodel = NBVAE(data_dim, latent_dim=latent_dim, hidden_dim=hidden_dim, full_cov=True)\n\ndataset1, dataset2 = two_different_datasets()\nshapes = [x.shape[0] for x in (dataset1, dataset2)]\n\n# Use union or intersection for genes to get the same feature-set if necessary.\n# Assume both datasets have the same features and are in row-major format.\n\nmerged_dataset = torch.vstack([dataset1, dataset2])\n\nmerged_labels = torch.vstack([torch.zeros(x, dtype=torch.long).reshape(-1, 1) + xi for xi, x in enumerate(shapes)]).to(merged_dataset.dtype)\n\n# Now you can train:\n# Get batches of data + labels\n\nidx = sampled_set()\ndata = merged_data[idx]\nlabels = merged_labels[idx]\n\n# labels can be one-hot or logits\n# One-hot items (which are torch.long dtype) are treated as logits * 20, so 20,000 more likely to be the provided class.\n# You can raise or lower this ratio with `temp=` for the forward call.\n# Logits are used directly otherwise.\npacked_output = model(data, labels)\n\nlabels = label_logits\n# Get a tuple out\nlatent, losses, zinb, class_info = packed_output\n# Or a dictionary, which is easier to reason with.\nlabeled_output = model.labeled_unpack(packed_output)\nlatent_repr, sampled_repr, nb_model, logvar, full_cov = latent\nmodel_loss, reconstruction_loss = losses[:2]\nclass_logits, class_loss = class_info\ntotal_loss = model_loss.sum() + reconstruction_loss.sum() + class_loss.sum()\ntotal_loss.backward()\n```\n\nBy having the model learn the classes, it can try to distinguish batches/effect types.\n\nIf you don't provide the class labels, the model will still generate logits for categorical labels, but it will only use count data to estimate. This gives it a light semi-supervised approach.\n\nI aim to test this using rnaseq expression atlases for normal background (e.g., GTEx) for bulk data but leveraged for single-cell analysis.\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fdnbaker%2Fscavenger","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Fdnbaker%2Fscavenger","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fdnbaker%2Fscavenger/lists"}