{"id":44816874,"url":"https://github.com/gcol33/resolve","last_synced_at":"2026-06-12T23:01:08.138Z","repository":{"id":333400394,"uuid":"1137154818","full_name":"gcol33/resolve","owner":"gcol33","description":"Neural network framework for species distribution modelling (PyTorch/C++/CUDA)","archived":false,"fork":false,"pushed_at":"2026-05-29T21:59:54.000Z","size":67411,"stargazers_count":0,"open_issues_count":0,"forks_count":0,"subscribers_count":0,"default_branch":"master","last_synced_at":"2026-05-29T23:20:45.571Z","etag":null,"topics":["cpp","cuda","deep-learning","ecology","machine-learning","neural-network","pytorch","species-distribution"],"latest_commit_sha":null,"homepage":null,"language":"C++","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":"mit","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/gcol33.png","metadata":{"files":{"readme":"README.md","changelog":"NEWS.md","contributing":"CONTRIBUTING.md","funding":null,"license":"LICENSE.md","code_of_conduct":"CODE_OF_CONDUCT.md","threat_model":null,"audit":null,"citation":"CITATION.cff","codeowners":null,"security":"SECURITY.md","support":null,"governance":null,"roadmap":"ROADMAP.md","authors":null,"dei":null,"publiccode":null,"codemeta":null,"zenodo":null,"notice":null,"maintainers":null,"copyright":null,"agents":null,"dco":null,"cla":null}},"created_at":"2026-01-19T01:49:18.000Z","updated_at":"2026-05-29T21:59:59.000Z","dependencies_parsed_at":null,"dependency_job_id":null,"html_url":"https://github.com/gcol33/resolve","commit_stats":null,"previous_names":["gcol33/spacc","gcol33/resolve"],"tags_count":5,"template":false,"template_full_name":null,"purl":"pkg:github/gcol33/resolve","repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/gcol33%2Fresolve","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/gcol33%2Fresolve/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/gcol33%2Fresolve/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/gcol33%2Fresolve/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/gcol33","download_url":"https://codeload.github.com/gcol33/resolve/tar.gz/refs/heads/master","sbom_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/gcol33%2Fresolve/sbom","scorecard":null,"host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":286080680,"owners_count":34265491,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2026-05-26T15:22:16.424Z","status":"online","status_checked_at":"2026-06-12T02:00:06.859Z","response_time":109,"last_error":null,"robots_txt_status":"success","robots_txt_updated_at":"2025-07-24T06:49:26.215Z","robots_txt_url":"https://github.com/robots.txt","online":true,"can_crawl_api":true,"host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["cpp","cuda","deep-learning","ecology","machine-learning","neural-network","pytorch","species-distribution"],"created_at":"2026-02-16T18:37:02.840Z","updated_at":"2026-06-12T23:01:08.129Z","avatar_url":"https://github.com/gcol33.png","language":"C++","funding_links":[],"categories":[],"sub_categories":[],"readme":"# RESOLVE\n\n[![Tests](https://github.com/gcol33/resolve/actions/workflows/tests.yml/badge.svg)](https://github.com/gcol33/resolve/actions/workflows/tests.yml)\n[![Documentation](https://img.shields.io/badge/docs-online-blue)](https://gillescolling.com/resolve)\n[![License: MIT](https://img.shields.io/badge/License-MIT-yellow.svg)](https://opensource.org/licenses/MIT)\n[![Python 3.10+](https://img.shields.io/badge/python-3.10+-blue.svg)](https://www.python.org/downloads/)\n\n**Representation Encoding for Structured Observation Learning with Vector Embeddings**\n\nA torch-based package for predicting sample attributes from compositional data—sets of entities with optional abundances or weights.\n\n## Overview\n\nRESOLVE treats compositional data as *contextual signal*—a rich, structured representation that encodes information about sample-level attributes. Given a set of entities (species in a plot, symptoms in a patient, products in a basket), RESOLVE learns to predict properties of the sample.\n\n**Core idea**: Compositional data encodes a shared latent representation that simultaneously informs multiple sample attributes.\n\n### Example Domains\n\n| Domain | Entities | Sample | Predictions |\n|--------|----------|--------|-------------|\n| **Ecology** | Plant species | Vegetation plot | Plot area, habitat type, elevation |\n| **Medicine** | Symptoms, conditions | Patient | Diagnosis, severity, treatment response |\n| **Retail** | Products | Shopping basket | Customer segment, churn risk |\n| **Genomics** | Genes, variants | Sample | Phenotype, disease risk |\n| **Text** | Words, n-grams | Document | Topic, sentiment, author |\n\n## Quick Start\n\n```python\nfrom resolve import ResolveDataset, Trainer\n\n# Load data\ndataset = ResolveDataset.from_csv(\n    header=\"samples.csv\",       # one row per sample\n    species=\"entities.csv\",     # entity-sample associations\n    roles={\n        \"plot_id\": \"sample_id\",\n        \"species_id\": \"entity_id\",\n        \"species_plot_id\": \"sample_id\",\n    },\n    targets={\"y\": {\"column\": \"response\", \"task\": \"regression\"}},\n)\n\n# Train\ntrainer = Trainer(dataset)\ntrainer.fit()\n\n# Predict with confidence filtering\npreds = trainer.predict(dataset)\npreds = trainer.predict(dataset, confidence_threshold=0.8)\n```\n\n### Ecology Example\n\nPredict vegetation plot area and habitat from species composition:\n\n```python\nfrom resolve import ResolveDataset, Trainer, RoleMapping, TargetConfig, TrainerConfig\n\ndataset = ResolveDataset.from_csv(\n    header=\"plots.csv\",\n    species=\"species_records.csv\",\n    roles=RoleMapping(\n        plot_id=\"PlotID\",\n        species_id=\"Species\",\n        species_plot_id=\"PlotID\",\n        abundance=\"Cover\",\n        taxonomy_genus=\"Genus\",\n        taxonomy_family=\"Family\",\n        coords_lat=\"Latitude\",\n        coords_lon=\"Longitude\",\n    ),\n    targets={\n        \"area\": TargetConfig(column=\"Area\", task=\"regression\", transform=\"log1p\"),\n        \"habitat\": TargetConfig(column=\"Habitat\", task=\"classification\", num_classes=5),\n    },\n)\n\nconfig = TrainerConfig(hash_dim=64, top_k=10, hidden_dims=[512, 256, 128])\ntrainer = Trainer(**config.to_trainer_kwargs(dataset))\ntrainer.fit()\n```\n\n### Medical Example\n\nPredict diagnosis from patient symptoms:\n\n```python\ndataset = ResolveDataset.from_csv(\n    header=\"patients.csv\",\n    species=\"symptoms.csv\",\n    roles=RoleMapping(\n        plot_id=\"patient_id\",\n        species_id=\"symptom_code\",\n        species_plot_id=\"patient_id\",\n        abundance=\"severity\",  # optional: symptom intensity\n    ),\n    targets={\n        \"diagnosis\": TargetConfig(column=\"icd_code\", task=\"classification\", num_classes=50),\n        \"severity\": TargetConfig(column=\"severity_score\", task=\"regression\"),\n    },\n)\n```\n\n## Features\n\n| Feature | Description |\n|---------|-------------|\n| **Hybrid entity encoding** | Feature hashing for full entity lists + learned embeddings for dominant entities |\n| **Multi-target prediction** | Single shared encoder, multiple task heads (regression \u0026 classification) |\n| **Phased training** | MAE → SMAPE → band accuracy optimization |\n| **Semantic role mapping** | Flexible column naming via `RoleMapping` dataclass |\n| **Unknown entity tracking** | Detects and quantifies novel entities at inference time |\n| **Abundance normalization** | Raw, normalized (sum-to-one), or log1p modes |\n| **Confidence filtering** | Set threshold to filter uncertain predictions |\n| **Typed configuration** | `TrainerConfig` dataclass with presets (TINY_MODEL → MAX_MODEL) |\n| **CPU-first** | Works without GPU, scales with CUDA when available |\n\n## Performance\n\nOptimized CUDA kernels for GPU acceleration. Benchmarks on RTX 4090:\n\n| Operation | Dataset Size | CPU | GPU | Speedup |\n|-----------|-------------|-----|-----|---------|\n| Hash Embedding | 10K records | 0.08 ms | 0.02 ms | 5x |\n| Hash Embedding | 100K records | 1.3 ms | 0.04 ms | **35x** |\n| Hash Embedding | 1M records | 32 ms | 0.08 ms | **400x** |\n\n## Installation\n\n```bash\npip install resolve\n```\n\nOr from source:\n\n```bash\ngit clone https://github.com/gcol33/resolve.git\ncd resolve\npip install -e .\n```\n\n## Architecture\n\n```\nEntity data ──────┐\n                  ├──→ EntityEncoder ──→ hash embedding + hierarchy IDs\nCoordinates ──────┤                      + unknown mass features\n                  ├──→ SampleEncoder (shared) ──→ latent representation\nCovariates ───────┘\n                                                      │\n                                    ┌─────────────────┼─────────────────┐\n                                    ↓                 ↓                 ↓\n                              TaskHead(y1)     TaskHead(y2)     TaskHead(y3)\n                                    │                 │                 │\n                                    ↓                 ↓                 ↓\n                              regression       regression       classification\n```\n\n### Linear Compositional Pooling\n\nEntity effects are aggregated linearly (abundance-weighted sum) before nonlinear mixing in the encoder. This preserves interpretability: each entity contributes additively to the latent signal before the network learns complex interactions.\n\n## Configuration\n\nUse `TrainerConfig` for clean, reusable training setups:\n\n```python\nfrom resolve import TrainerConfig\n\n# Custom config\nconfig = TrainerConfig(\n    hash_dim=128,\n    top_k=20,\n    hidden_dims=[1024, 512, 256],\n    max_epochs=500,\n    patience=30,\n)\n\n# Or use presets\nfrom resolve.config import LARGE_MODEL, MEDIUM_MODEL\ntrainer = Trainer(**LARGE_MODEL.to_trainer_kwargs(dataset))\n```\n\n## Limiting GPU VRAM usage\n\nRESOLVE leaves the PyTorch CUDA caching allocator uncapped by default\n(`vram_fraction = 1.0`) so dedicated training jobs on a solo GPU use the full\ndevice. Pass an explicit lower value when sharing the GPU with a desktop or\nother workloads — the Windows WDDM driver spills allocations beyond physical\nVRAM into shared system memory, which freezes the whole desktop under load,\nso leaving ~20% headroom keeps the compositor, browser, and other GPU\nclients responsive while training runs.\n\n```python\nfrom resolve_core import TrainConfig\n\ncfg = TrainConfig()\ncfg.vram_fraction = 1.0   # default — dedicated training job on solo GPU\n\n# Sharing the GPU with a desktop / GUI: leave headroom\ncfg.vram_fraction = 0.80\n```\n\nCLI:\n\n```bash\nresolve train --vram-fraction 1.0 ...    # default (dedicated)\nresolve predict --vram-fraction 0.80 ... # shared with desktop\n```\n\nR:\n\n```r\ntrainer \u003c- Trainer$new(model, list(\n    batch_size = 4096L,\n    device = \"cuda\",\n    vram_fraction = 1.0  # default\n))\n```\n\nThe cap applies to both `Trainer` and `Predictor::load`. To apply it\nindependently of either (e.g., before loading any model):\n\n```python\nimport resolve_core\nresolve_core.set_vram_fraction(0.80)  # affects current CUDA device\n```\n\n`Predictor.load` defaults to `device=\"cpu\"` and `predict_dataset` chunks\nits forward pass at `batch_size = 4096` along dim 0, with results\nconcatenated on CPU. Pass `batch_size=-1` to opt back into the legacy\none-shot path (only safe when the whole test set fits on the device).\n\n### Auto-halve `batch_size` on OOM\n\n`Trainer.fit()` catches `c10::OutOfMemoryError`, releases optimizer / AMP /\nGPU caches, halves `batch_size`, and restarts training from epoch 0 against\nthe original model weights. The retry stops at `batch_size_floor` (default\n1024); below the floor the OOM rethrows. After `fit()` returns,\n`trainer.config.batch_size` is the effective batch size that actually\ntrained the model, also persisted in the checkpoint.\n\n```python\ncfg.batch_size = 16384\ncfg.batch_size_floor = 1024\n```\n\nCLI: `resolve train --batch-size 16384 --batch-size-floor 1024`.\n\n### CUDA allocator config (Linux vs Windows)\n\n`PYTORCH_CUDA_ALLOC_CONF` is set automatically at `import resolve_core` to\na platform-aware default:\n`expandable_segments:True,garbage_collection_threshold:0.8,max_split_size_mb:256`\non Linux/mac, and the same without `expandable_segments` on Windows (the\ncuMemMap-backed allocator is not implemented on win32; libtorch warns\notherwise). To overwrite an existing value or simply log what is active:\n`resolve_core.configure_cuda_allocator(force=True)`.\n\n## Documentation\n\n- **[Getting Started](https://gillescolling.com/resolve/tutorials/quickstart/)**: Complete workflow walkthrough\n- **[Data Preparation](https://gillescolling.com/resolve/tutorials/data-preparation/)**: Data formatting guide\n- **[Training](https://gillescolling.com/resolve/tutorials/training/)**: Advanced training options\n- **[API Reference](https://gillescolling.com/resolve/api/dataset/)**: Full API documentation\n\n## Requirements\n\n- Python ≥ 3.10\n- PyTorch ≥ 2.0\n- pandas ≥ 2.0\n- scikit-learn ≥ 1.3\n\n## License\n\nMIT License - see [LICENSE.md](LICENSE.md) for details.\n\n## Citation\n\nIf you use RESOLVE in your research, please cite:\n\n```bibtex\n@software{resolve,\n  author = {Colling, Gilles},\n  title = {RESOLVE: Representation Encoding for Structured Observation Learning with Vector Embeddings},\n  year = {2025},\n  url = {https://github.com/gcol33/resolve}\n}\n```\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fgcol33%2Fresolve","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Fgcol33%2Fresolve","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fgcol33%2Fresolve/lists"}