{"id":30838804,"url":"https://github.com/eleutherai/bergson","last_synced_at":"2026-04-10T06:24:27.917Z","repository":{"id":293438814,"uuid":"983405046","full_name":"EleutherAI/bergson","owner":"EleutherAI","description":"Mapping out the \"memory\" of neural nets with data attribution","archived":false,"fork":false,"pushed_at":"2025-09-02T22:57:11.000Z","size":4502,"stargazers_count":25,"open_issues_count":2,"forks_count":5,"subscribers_count":2,"default_branch":"main","last_synced_at":"2025-09-03T00:27:29.159Z","etag":null,"topics":[],"latest_commit_sha":null,"homepage":"","language":"Python","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":"mit","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/EleutherAI.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":null,"license":"LICENSE","code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null,"roadmap":null,"authors":null,"dei":null,"publiccode":null,"codemeta":null,"zenodo":null,"notice":null,"maintainers":null,"copyright":null,"agents":null,"dco":null,"cla":null}},"created_at":"2025-05-14T10:34:17.000Z","updated_at":"2025-09-02T22:57:14.000Z","dependencies_parsed_at":null,"dependency_job_id":"cf02520e-3e36-4408-9752-f400d43da9fe","html_url":"https://github.com/EleutherAI/bergson","commit_stats":null,"previous_names":["eleutherai/quelle","eleutherai/bergson"],"tags_count":0,"template":false,"template_full_name":null,"purl":"pkg:github/EleutherAI/bergson","repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/EleutherAI%2Fbergson","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/EleutherAI%2Fbergson/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/EleutherAI%2Fbergson/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/EleutherAI%2Fbergson/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/EleutherAI","download_url":"https://codeload.github.com/EleutherAI/bergson/tar.gz/refs/heads/main","sbom_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/EleutherAI%2Fbergson/sbom","scorecard":null,"host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":273941527,"owners_count":25195104,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","status":"online","status_checked_at":"2025-09-06T02:00:13.247Z","response_time":2576,"last_error":null,"robots_txt_status":"success","robots_txt_updated_at":"2025-07-24T06:49:26.215Z","robots_txt_url":"https://github.com/robots.txt","online":true,"can_crawl_api":true,"host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":[],"created_at":"2025-09-06T18:12:21.640Z","updated_at":"2026-04-01T17:54:50.399Z","avatar_url":"https://github.com/EleutherAI.png","language":"Python","funding_links":[],"categories":[],"sub_categories":[],"readme":"# Bergson\nThis library enables you to trace the memory of deep neural nets with gradient-based data attribution techniques. We currently focus on TrackStar, as described in [Scalable Influence and Fact Tracing for Large Language Model Pretraining](https://arxiv.org/abs/2410.17413v3) by Chang et al. (2024), [Magic](https://arxiv.org/abs/2504.16430), and also include support for several alternative influence functions.\n\nWe view attribution as a counterfactual question: **_If we \"unlearned\" this training sample, how would the model's behavior change?_** This formulation ties attribution to some notion of what it means to \"unlearn\" a training sample. Here we focus on a very simple notion of unlearning: taking a gradient _ascent_ step on the loss with respect to the training sample.\n\n## Core features\n\n- Gradient store for serial queries. We provide collection-time gradient compression for efficient storage, and integrate with FAISS for fast KNN search over large stores.\n- On-the-fly queries. Query gradients without disk I/O overhead via a single pass over a dataset with a set of precomputed query gradients.\n  - Experiment with multiple query strategies based on [LESS](https://arxiv.org/pdf/2402.04333).\n  - Ideal for compression-free gradients.\n- Per-token scores.\n- Train‑time gradient collection. Capture gradients produced during training with a ~17% performance overhead.\n- Scalable. We use [FSDP2](https://docs.pytorch.org/tutorials/intermediate/FSDP_tutorial.html), BitsAndBytes, and other performance optimizations to support large models, datasets, and clusters.\n- Integrated with HuggingFace Transformers and Datasets. We also support on-disk datasets in a variety of formats.\n- Structured gradient views and per-attention head gradient collection. Bergson enables mechanistic interpretability via easy access to per‑module or per-attention head gradients.\n\n# Announcements\n\n**March 2026**\n- Support MAGIC\n\n**February 2026**\n- Support per-token gradients\n\n**January 2026**\n- Support EK-FAC\n- [Experimental] Support distributing preconditioners across nodes and devices for VRAM-efficient computation through the GradientCollectorWithDistributedPreconditioners. If you would like this functionality exposed via the CLI please get in touch! https://github.com/EleutherAI/bergson/pull/100\n\n# Installation\n\n```bash\npip install bergson\n```\n\n# Quickstart\n\nTo construct an index of randomly projected gradients:\n\n```bash\nbergson build runs/index --model EleutherAI/pythia-14m --dataset NeelNanda/pile-10k --truncation --token_batch_size 4096\n```\n\nTo collect Trackstar attribution scores:\n\n```bash\nbergson trackstar runs/trackstar --model EleutherAI/pythia-14m --query.dataset NeelNanda/pile-10k --data.dataset NeelNanda/pile-10k --data.truncation --token_batch_size 4096 --query.truncation --query.split \"train[:20]\"\n```\n\nTo use MAGIC on a GPT-2 WikiText fine-tune:\n\n```bash\nbergson magic examples/magic/gpt2_wikitext_tiny.yaml\n```\n\n# Usage\n\nThere are two ways to use Bergson. The first is to write an index of dataset gradients to disk using `build` then query it programmatically or using the `Attributor` or `query` CLI. The second is to specify your query upfront, then map over the dataset and collect and process gradients on the fly. When using this second strategy only influence scores will be saved.\n\nYou can build an index of gradients for each training sample from the command line, using `bergson` as a CLI tool:\n\n```bash\nbergson build \u003coutput_path\u003e --model \u003cmodel_name\u003e --dataset \u003cdataset_name\u003e\n```\n\nThis will create a directory at `\u003coutput_path\u003e` containing the gradients for each training sample in the specified dataset. The `--model` and `--dataset` arguments should be compatible with the Hugging Face `transformers` library. By default it assumes that the dataset has a `text` column, but you can specify other columns using `--prompt_column` and optionally `--completion_column`. The `--help` flag will show you all available options.\n\nYou can also use the library programmatically to build an index. The `collect_gradients` function is just a bit lower level the CLI tool, and allows you to specify the model and dataset directly as arguments. The result is a HuggingFace dataset which contains a handful of new columns, including `gradients`, which contains the gradients for each training sample. You can then use this dataset to compute attributions.\n\nAt the lowest level of abstraction, the `GradientCollector` context manager allows you to efficiently collect gradients for _each individual example_ in a batch during a backward pass, simultaneously randomly projecting the gradients to a lower-dimensional space to save memory. If you use Adafactor normalization we will do this in a very compute-efficient way which avoids computing the full gradient for each example before projecting it to the lower dimension. There are two main ways you can use `GradientCollector`:\n\n1. Using a `closure` argument, which enables you to make use of the per-example gradients immediately after they are computed, during the backward pass. If you're computing summary statistics or other per-example metrics, this is the most efficient way to do it.\n2. Without a `closure` argument, in which case the gradients are collected and returned as a dictionary mapping module names to batches of gradients. This is the simplest and most flexible approach but is a bit more memory-intensive.\n\n## On-the-fly Query\n\nYou can score a large dataset against a previously built query index without saving its gradients to disk:\n\n```bash\nbergson score \u003coutput_path\u003e --model \u003cmodel_name\u003e --dataset \u003cdataset_name\u003e --query_path \u003cexisting_index_path\u003e --score individual --aggregation mean\n```\n\nWe provide a utility to reduce a dataset into its mean or sum query gradient, for use as a query index:\n\n```bash\nbergson reduce \u003coutput_path\u003e --model \u003cmodel_name\u003e --dataset \u003cdataset_name\u003e --aggregation mean --unit_normalize\n```\n\n## Index Query\n\nWe provide a query Attributor which supports unit normalized gradients and KNN search out of the box. Access it via CLI with\n\n```bash\nbergson query --index  \u003cindex_path\u003e --model \u003cmodel_name\u003e --unit_norm\n```\n\nor programmatically with\n\n```python\nfrom bergson import Attributor, FaissConfig\n\nattr = Attributor(args.index, device=\"cuda\")\n\n...\nquery_tokens = tokenizer(query, return_tensors=\"pt\").to(\"cuda:0\")[\"input_ids\"]\n\n# Query the index\nwith attr.trace(model.base_model, 5) as result:\n    model(query_tokens, labels=query_tokens).loss.backward()\n    model.zero_grad()\n```\n\nTo efficiently query on-disk indexes, perform ANN searches, and explore many other scalability features add a FAISS config:\n\n```python\nattr = Attributor(args.index, device=\"cuda\", faiss_cfg=FaissConfig(\"IVF1,SQfp16\", mmap_index=True))\n\nwith attr.trace(model.base_model, 5) as result:\n    model(query_tokens, labels=query_tokens).loss.backward()\n    model.zero_grad()\n```\n\n## Training Gradients\n\nGradient collection during training is supported via an integration with HuggingFace's Trainer and SFTTrainer classes. Training gradients are saved in the original order corresponding to their dataset items, and when the `track_order` flag is set the training steps associated with each training item are separately saved.\n\n```python\nfrom bergson import GradientCollectorCallback, prepare_for_gradient_collection\n\ncallback = GradientCollectorCallback(\n    path=\"runs/example\",\n    track_order=True,\n)\ntrainer = Trainer(\n    model=model,\n    args=training_args,\n    train_dataset=dataset,\n    eval_dataset=dataset,\n    callbacks=[callback],\n)\ntrainer = prepare_for_gradient_collection(trainer)\ntrainer.train()\n```\n\n## Attention Head Gradients\n\nBy default Bergson collects gradients for named parameter matrices, but per-attention head gradients may be collected by configuring an AttentionConfig for each module of interest.\n\n```python\nfrom bergson import AttentionConfig, IndexConfig, DataConfig\nfrom transformers import AutoModelForCausalLM\n\nmodel = AutoModelForCausalLM.from_pretrained(\"RonenEldan/TinyStories-1M\", trust_remote_code=True, use_safetensors=True)\n\ncollect_gradients(\n    model=model,\n    data=data,\n    processor=processor,\n    path=\"runs/split_attention\",\n    attention_cfgs={\n        # Head configuration for the TinyStories-1M transformer\n        \"h.0.attn.attention.out_proj\": AttentionConfig(num_heads=16, head_size=4, head_dim=2),\n    },\n)\n```\n\n## GRPO\n\nWhere a reward signal is available we compute gradients using a weighted advantage estimate based on Dr. GRPO:\n\n```bash\nbergson build \u003coutput_path\u003e --model \u003cmodel_name\u003e --dataset \u003cdataset_name\u003e --reward_column \u003creward_column_name\u003e\n```\n\n## Numerical Stability\n\nSome models produce inconsistent per-example gradients when batched together. This is caused by nondeterminism in optimized SDPA attention backends (flash, memory-efficient) — the diagnostic tests both padding-induced and equal-length batch divergence to pinpoint the source.\n\nUse the built-in diagnostic to check your model:\n\n```bash\nbergson test_model_configuration --model \u003cmodel_name\u003e\n```\n\nThis automatically tests escalating configurations and reports exactly which flags (if any) you need:\n\n```bash\n# If force_math_sdp alone is sufficient:\nbergson build \u003coutput_path\u003e --model \u003cmodel_name\u003e --force_math_sdp\n# If fp32 with TF32 matmuls is sufficient (cheaper than full fp32):\nbergson build \u003coutput_path\u003e --model \u003cmodel_name\u003e --precision fp32 --use_tf32_matmuls --force_math_sdp\n# If full fp32 precision is required:\nbergson build \u003coutput_path\u003e --model \u003cmodel_name\u003e --precision fp32 --force_math_sdp\n```\n\n### Performance impact\n\nBenchmarked on A100-80GB with 500 documents from pile-10k:\n\n| Model | Settings | Build time | vs bf16 baseline |\n|-------|----------|------------|------------------|\n| Pythia-160M | bf16 | 31.2s | — |\n| Pythia-160M | bf16 + `--force_math_sdp` | 31.0s | -0.7% |\n| Pythia-160M | fp32 + `--use_tf32_matmuls` | 26.6s | -14.7% |\n| Pythia-160M | fp32 + `--use_tf32_matmuls` + `--force_math_sdp` | 27.5s | -11.9% |\n| Pythia-160M | fp32 | 35.4s | +13.3% |\n| Pythia-160M | fp32 + `--force_math_sdp` | 40.6s | +29.9% |\n| OLMo-2-1B | bf16 | 45.5s | — |\n| OLMo-2-1B | bf16 + `--force_math_sdp` | 53.9s | +18.4% |\n| OLMo-2-1B | fp32 + `--use_tf32_matmuls` | 51.3s | +12.7% |\n| OLMo-2-1B | fp32 + `--use_tf32_matmuls` + `--force_math_sdp` | 54.0s | +18.8% |\n| OLMo-2-1B | fp32 | 131.8s | +189.8% |\n| OLMo-2-1B | fp32 + `--force_math_sdp` | 141.2s | +210.5% |\n\n`--use_tf32_matmuls` with fp32 precision is significantly cheaper than full fp32 and may be sufficient for many models.\n\nNot all models are affected — run `bergson test_model_configuration` before enabling these flags to avoid unnecessary overhead.\n\n# Benchmarks\n\n![CLI Benchmark](docs/benchmarks/cli_benchmark_NVIDIA_GH200_120GB.png)\n\nSee `benchmarks/` for scripts to reproduce and generate benchmarks on your own hardware.\n\n# Development\n\n```bash\npip install -e \".[dev]\"\npre-commit install\npytest\npyright\n```\n\nWe use [conventional commits](https://www.conventionalcommits.org/en/v1.0.0/) for releases.\n\n# Citation\n\nIf you found Bergson useful in your research, please cite us:\n\n```bibtex\n@software{bergson,\n  author       = {Lucia Quirke and Nora Belrose and Louis Jaburi and William Li and David Johnston and Michael Mulet and Guillaume Martres and Goncalo Paulo and Stella Biderman},\n  title        = {Bergson: Mapping out the \"memory\" of neural nets with data attribution},\n  year         = {2026},\n  publisher    = {Zenodo},\n  doi          = {10.5281/zenodo.18906967},\n  url          = {https://doi.org/10.5281/zenodo.18906967}\n}\n```\n\n# Support\n\nIf you have suggestions, questions, or would like to collaborate, please email lucia@eleuther.ai or drop us a line in the #data-attribution channel of the EleutherAI Discord!\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Feleutherai%2Fbergson","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Feleutherai%2Fbergson","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Feleutherai%2Fbergson/lists"}