{"id":26692731,"url":"https://github.com/ilyalasy/memorization_circuits","last_synced_at":"2026-06-30T16:31:12.496Z","repository":{"id":284583937,"uuid":"955406748","full_name":"ilyalasy/memorization_circuits","owner":"ilyalasy","description":"Applied mechanistic interpretability techniques to find circuits behind memorization processes in GPT-NEO-125m","archived":false,"fork":false,"pushed_at":"2025-06-17T15:45:12.000Z","size":2237,"stargazers_count":1,"open_issues_count":0,"forks_count":0,"subscribers_count":1,"default_branch":"main","last_synced_at":"2025-10-16T08:04:32.957Z","etag":null,"topics":["circuits","counterfactual","mechanistic-interpretability","memorization"],"latest_commit_sha":null,"homepage":"","language":"Jupyter Notebook","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":null,"status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/ilyalasy.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":null,"license":null,"code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null,"roadmap":null,"authors":null,"dei":null,"publiccode":null,"codemeta":null,"zenodo":null,"notice":null,"maintainers":null,"copyright":null,"agents":null,"dco":null,"cla":null}},"created_at":"2025-03-26T15:33:50.000Z","updated_at":"2025-07-15T11:29:11.000Z","dependencies_parsed_at":"2025-10-16T05:19:11.243Z","dependency_job_id":"a05e8be1-36ad-45e1-824c-017700502e17","html_url":"https://github.com/ilyalasy/memorization_circuits","commit_stats":null,"previous_names":["ilyalasy/memorization_circuits"],"tags_count":0,"template":false,"template_full_name":null,"purl":"pkg:github/ilyalasy/memorization_circuits","repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/ilyalasy%2Fmemorization_circuits","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/ilyalasy%2Fmemorization_circuits/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/ilyalasy%2Fmemorization_circuits/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/ilyalasy%2Fmemorization_circuits/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/ilyalasy","download_url":"https://codeload.github.com/ilyalasy/memorization_circuits/tar.gz/refs/heads/main","sbom_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/ilyalasy%2Fmemorization_circuits/sbom","scorecard":null,"host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":286080680,"owners_count":34975668,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2026-05-26T15:22:16.424Z","status":"online","status_checked_at":"2026-06-30T02:00:05.919Z","response_time":92,"last_error":null,"robots_txt_status":"success","robots_txt_updated_at":"2025-07-24T06:49:26.215Z","robots_txt_url":"https://github.com/robots.txt","online":true,"can_crawl_api":true,"host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["circuits","counterfactual","mechanistic-interpretability","memorization"],"created_at":"2025-03-26T17:34:46.897Z","updated_at":"2026-06-30T16:31:12.478Z","avatar_url":"https://github.com/ilyalasy.png","language":"Jupyter Notebook","funding_links":[],"categories":[],"sub_categories":[],"readme":"# Understanding Verbatim Memorization in LLMs Through Circuit Discovery\n\nThis repository implements a pipeline for discovering circuits responsible for verbatim memorization in large language models. The pipeline consists of four main stages: dataset (The Pile) collection, contrastive dataset creation, circuit discovery, and circuit verification.\n\n## Pipeline Overview\n\nThe complete pipeline can be executed using `run_pipeline.sh`, which orchestrates the following stages:\n\n### 1. Memorization Score Calculation (`memorization_score.py`)\n\nDownloads a specified dataset and calculates memorization scores for each sample by:\n- Using the first `n` tokens as context prompts\n- Generating `y` tokens with the model\n- Computing exact token match scores between generated and ground truth completions\n- Saving contexts, completions, and memorization scores to JSON format\n\n**Usage:**\n```bash\npython memorization_score.py \\\n    --model_name \"EleutherAI/gpt-neo-125m\" \\\n    --prompt_tokens 50 \\\n    --generation_tokens 50 \\\n    --dataset \"timaeus/pile-wikipedia_en\"\n```\n\nNote: this works directly with preprocessed huggingface dataset. Instead, you can first download some subsets of the pile by using [download_pile_subset.sh](download_pile_subset.sh) and then use [`memorization_score.py`](memorization_score.py) with downloaded path.\n\n### 2. Contrastive Dataset Creation (`contrastive_dataset.py`)\n\nCreates contrastive datasets for circuit analysis using two approaches:\n\n#### Branch Decision (`--contrastive_mode divergence`)\nThis approach focuses on the precise moment memorization breaks down:\n\n1. **Divergence point detection**: For each memorized sample, the algorithm progressively shortens the context until there's a significant relative drop (\u003e30% by default) in the BLEU-4 score compared to the previous context length, AND the model's next token differs from ground truth\n2. **Clean examples**: Original memorized context truncated to the divergence point + correct next token\n3. **Corrupt examples**: Same truncated context + model's predicted (incorrect) token\n4. **Contrastive pair format**: `(context + correct_token, context + wrong_token) → (next_correct_token, next_wrong_token)`\n- **Purpose**: Understanding the moment where the model 'decides' to memorize vs. generate novel content\n\n\n#### Memorization Decision (`--contrastive_mode dataset`)\nThis approach contrasts memorized vs. non-memorized content, with enhanced precision when divergence data is available:\n\n**Step 1 - Load Branch Decision (optional)**: Optionaly loads results of `--contrastive_mode divergence` run\n\n**Step 2 - Find contrastive pairs**:\n- **With divergence data**: Finds low-memorization samples that have the same token at the divergence position as the high-memorization sample, then verifies the model would predict that same token, ensuring the contrast is at the exact decision point\n- **Without divergence data**: Uses model embeddings or token overlap to find semantically similar pairs between high and low memorization samples\n- **Similarity calculation**: Uses cosine similarity of model embeddings by default\n\n**Contrastive pair format**: `(memorized_context, non_memorized_context) → (model_prediction, correct_answer)`\n**Purpose**: Understanding what distinguishes memorizable from non-memorizable content at the neural level\n\n\n**Usage:**\n```bash\npython contrastive_dataset.py \\\n    --dataset \"timaeus/pile-wikipedia_en\" \\\n    --model_name \"EleutherAI/gpt-neo-125m\" \\\n    --threshold 0.75 \\\n    --contrastive_mode \"dataset\"  # or \"divergence\"\n```\n\n### 3. Circuit Discovery (`find_circuits.py`)\n\nUses [AutoCircuit library](https://ufo-101.github.io/auto-circuit/) to discover minimal neural circuits responsible for memorization behavior:\n\n1. **Edge Attribution**: Applies EAP-IG (Edge Attribution Patching with Integrated Gradients) to compute importance scores for each model edge\n2. **Binary Search**: Finds the minimal set of edges that maintains target performance (default: 85% of baseline)\n\n**Key Parameters:**\n- `--grad_function`: Function applied to logits before gradient computation (`logit`, `prob`, `logprob`)\n- `--loss_function`: Optimization target (`avg_diff`, `avg_val_wrong`, etc.)\n- `--optimize_metric`: Performance metric for circuit search (`logit_diff`, `answer_logit`, etc.)\n\n**Usage:**\n```bash\npython find_circuits.py \\\n    --model_name \"EleutherAI/gpt-neo-125m\" \\\n    --path \"data/results/contrastive_dataset.json\" \\\n    --grad_function \"logit\" \\\n    --loss_function \"avg_val_wrong\"\n```\n\nThere was an attempt in [`find_circuits_eap.py`](find_circuits_eap.py) to try [original repo by Hanna et. al.](https://github.com/hannamw/EAP-IG) but AutoCircuit patching ended up being much faster.\n\n### 4. Circuit Verification (`verify_circuit.py`)\n\nValidates discovered circuits by:\n- Loading pre-computed prune scores and applying specified edge counts\n- Evaluating circuit performance on test datasets using [defined metrics](find_circuits.py#L93)\n- Comparing against circuits containing random edges\n- Computing faithfulness scores relative to full model performance\n\n**Usage:**\n```bash\npython verify_circuit.py \\\n    --prune_scores_path \"data/circuits/prune_scores.pkl\" \\\n    --edge_count 50 \\\n    --dataset_path \"data/results/test_dataset.json\"\n```\n\n## Verification Scripts\n\nThe `verify_scripts/` directory contains reproduction scripts for various experimental configurations:\n\n- `verify_mem_decision_*.sh`: Memorization decision experiments\n- `verify_branch_*.sh`: Branch decision experiments  \n- `verify_ablations_*.sh`: Experiments with different ablation methods\n- `verify_*_random.sh`: Random baseline comparisons\n\n## Requirements\n\nSee [`requirements.txt`](requirements.txt)\n- PyTorch\n- Transformers\n- AutoCircuit ([my fork](https://github.com/ilyalasy/auto-circuit/tree/tokenization) that fixes couple bugs)\n- EAP (optional) ([my fork](https://github.com/ilyalasy/EAP-IG) with some changes needed to make it all run during my experiments)\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Filyalasy%2Fmemorization_circuits","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Filyalasy%2Fmemorization_circuits","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Filyalasy%2Fmemorization_circuits/lists"}