{"id":50719111,"url":"https://github.com/syvb/cocoracle","last_synced_at":"2026-06-09T22:01:24.547Z","repository":{"id":349410442,"uuid":"1201427271","full_name":"syvb/cocoracle","owner":"syvb","description":null,"archived":false,"fork":false,"pushed_at":"2026-04-05T19:14:07.000Z","size":55,"stargazers_count":1,"open_issues_count":0,"forks_count":0,"subscribers_count":0,"default_branch":"master","last_synced_at":"2026-04-05T21:15:08.703Z","etag":null,"topics":[],"latest_commit_sha":null,"homepage":null,"language":"Python","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":null,"status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/syvb.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":null,"license":null,"code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null,"roadmap":null,"authors":null,"dei":null,"publiccode":null,"codemeta":null,"zenodo":null,"notice":null,"maintainers":null,"copyright":null,"agents":null,"dco":null,"cla":null}},"created_at":"2026-04-04T17:01:39.000Z","updated_at":"2026-04-05T19:14:16.000Z","dependencies_parsed_at":null,"dependency_job_id":null,"html_url":"https://github.com/syvb/cocoracle","commit_stats":null,"previous_names":["syvb/cocoracle"],"tags_count":null,"template":false,"template_full_name":null,"purl":"pkg:github/syvb/cocoracle","repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/syvb%2Fcocoracle","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/syvb%2Fcocoracle/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/syvb%2Fcocoracle/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/syvb%2Fcocoracle/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/syvb","download_url":"https://codeload.github.com/syvb/cocoracle/tar.gz/refs/heads/master","sbom_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/syvb%2Fcocoracle/sbom","scorecard":null,"host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":286080680,"owners_count":34127345,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2026-05-26T15:22:16.424Z","status":"online","status_checked_at":"2026-06-09T02:00:06.510Z","response_time":63,"last_error":null,"robots_txt_status":"success","robots_txt_updated_at":"2025-07-24T06:49:26.215Z","robots_txt_url":"https://github.com/robots.txt","online":true,"can_crawl_api":true,"host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":[],"created_at":"2026-06-09T22:01:23.756Z","updated_at":"2026-06-09T22:01:24.542Z","avatar_url":"https://github.com/syvb.png","language":"Python","funding_links":[],"categories":[],"sub_categories":[],"readme":"# Cocoracle\n\n[![Models on HuggingFace](https://img.shields.io/badge/Models-HuggingFace-yellow)](https://huggingface.co/syvb/cocoracle)\n\n**Can we interpret what a model is \"thinking\" during latent reasoning?**\n\nThis project combines two ideas:\n- [**Coconut**](https://arxiv.org/abs/2412.06769) (Chain of Continuous Thought): LLMs that reason in latent space instead of generating text chain-of-thought tokens\n- [**Activation Oracles**](https://arxiv.org/abs/2512.15674) (AOs): Models trained to answer natural-language questions about another model's internal activations\n\nWe train Coconut models on GPT-2 that perform arithmetic with latent reasoning steps, then build a \"self-oracle\" — the same model fine-tuned to answer questions about its own hidden states — to interpret what the latent thoughts encode.\n\n## Key Results\n\n### Coconut model: latent reasoning works at scale\n\nGPT-2-large (774M) learns to reason in latent space far better than GPT-2-small (124M):\n\n| Stage | Description | GPT-2-small | GPT-2-large |\n|-------|------------|-------------|-------------|\n| 0 | Full text CoT | trained | trained |\n| 1 | Last step latent | 69% | **99%** |\n| 2 | Last 2 steps latent | 19% | **71%** |\n| 3 | All steps latent | 2.4% | **45%** |\n\n### Self-oracle: reading latent thoughts\n\nThe self-oracle approach — fine-tuning the Coconut model to interpret its own activations — achieves the first non-zero exact match on recovering chain-of-thought text from latent hidden states:\n\n| Configuration | CoT Exact Match | CoT Token F1 | AO Val Loss |\n|--------------|----------------|--------------|-------------|\n| Separate AO (GPT-2-small + LoRA) | 0% | 26.4% | 2.92 |\n| Self-oracle, GPT-2-small, stage 1 | 0% | 32.5% | 1.98 |\n| Self-oracle, GPT-2-large, stage 1 | 0% | 25.6% | 1.10 |\n| **Self-oracle, GPT-2-large, all-latent** | **6.9%** | **34.2%** | **0.55** |\n\nThe all-latent GPT-2-large self-oracle correctly recovers exact CoT strings like `\"2+8=10 write 0 carry 1\"` and `\"4+4=8 write 8\"` from latent hidden states alone 6.9% of the time, with 34% token-level F1. The random baseline is 0%, confirming the model reads real signal from the injected activations.\n\n### Linear probes confirm the information exists\n\nLinear probes on the latent thought hidden states (GPT-2-small, layer 6) achieve:\n- **100%** accuracy classifying which arithmetic step the model is on\n- **100%** accuracy predicting the first token of the reasoning step\n- **42%** exact match on predicting the full answer\n\nThe information is there and linearly separable. The challenge is in the generation pipeline.\n\n## How It Works\n\n### 1. Coconut Model\n\nGPT-2 fine-tuned with a multi-stage curriculum on multi-digit addition:\n\n```\nProblem: 347 + 285 =\nCoT: 7+5=12 write 2 carry 1 | 4+8+1=13 write 3 carry 1 | 3+2+1=6 write 6\nAnswer: 632\n```\n\nStage 0 trains with full text CoT. Stages 1-3 progressively replace CoT steps with latent continuous thought vectors — hidden states fed back as inputs instead of being decoded to text. At the \"all-latent\" stage, the model sees only the problem, performs all reasoning internally through a sequence of hidden state vectors, then produces the answer.\n\n### 2. Activation Collection\n\nFrom the trained Coconut model, we extract hidden states at each latent thought position from the 50% depth layer (layer 6 for GPT-2-small, layer 18 for GPT-2-large). Each hidden state is paired with the ground-truth CoT text it replaced.\n\n### 3. Self-Oracle\n\nThe Coconut model itself is fine-tuned to answer questions about its own activations:\n\n```\nInput:  \"Layer 18: \u003cact\u003e What is the intermediate calculation at this reasoning step?\"\n        (with Coconut's layer-18 hidden state injected at \u003cact\u003e via norm-matched addition)\nTarget: \"7+5=12 write 2 carry 1\u003c|endoftext|\u003e\"\n```\n\nKey design choices:\n- **Self-interpretation**: Using the same model (not a separate oracle) since it already understands arithmetic and its own internal representations\n- **Layer-matched injection**: Injecting at layer N-1 where activations were extracted from layer N, keeping the signal in the same representational space\n- **2x injection scaling**: Doubling the norm-matched signal strength to make the activation more visible to subsequent layers\n- **Mixed training**: 70% AO tasks + 30% original text CoT to prevent catastrophic forgetting\n- **EOS tokens**: Training targets end with `\u003c|endoftext|\u003e` so the model learns when to stop\n\n### What made the breakthrough\n\nThe jump from 0% to 6.9% exact match came from three factors combining:\n\n1. **GPT-2-large (774M)** produces a much better Coconut model (45% all-latent accuracy vs 2.4%), creating richer latent representations\n2. **All-latent mode** gives the AO multiple thought-step activations per problem instead of just one, providing more training signal\n3. **Self-oracle approach** leverages the model's existing knowledge of arithmetic and its own internal representations\n\n## Project Structure\n\n```\nsrc/\n  data_gen.py          # Synthetic arithmetic dataset with carry-propagation CoT\n  coconut_model.py     # GPT-2-small with continuous thought support\n  activation_oracle.py # Separate-model AO with LoRA (baseline)\n  self_oracle.py       # Self-oracle: Coconut model interprets its own activations\n\nscripts/\n  01_generate_data.py          # Generate 100K addition problems with CoT\n  02_train_coconut.py          # 4-stage curriculum (GPT-2-small)\n  03_collect_activations.py    # Extract hidden states from latent thought positions\n  04_train_oracle.py           # Train separate AO (baseline)\n  05_train_probes.py           # Train linear probe baselines\n  06_evaluate.py               # Full evaluation suite\n  07_train_self_oracle.py      # Train self-oracle (GPT-2-small)\n  08_eval_self_oracle.py       # Evaluate self-oracle\n  09_gpt2large_experiment.py   # GPT-2-large, stage 1 only\n  10_gpt2large_alllatent.py    # GPT-2-large, all stages through all-latent (best results)\n  run_all.sh                   # End-to-end pipeline (scripts 01-06)\n```\n\n## Reproduction\n\n```bash\npip install -r requirements.txt\n\n# GPT-2-small base experiment (~3 hours on RTX 4090)\nbash scripts/run_all.sh\n\n# GPT-2-small self-oracle (~1.5 hours)\npython scripts/07_train_self_oracle.py\npython scripts/08_eval_self_oracle.py\n\n# GPT-2-large stage 1 (~8 hours)\npython scripts/09_gpt2large_experiment.py\n\n# GPT-2-large all-latent — best results (~16 hours)\npython scripts/10_gpt2large_alllatent.py\n```\n\nRequires a GPU with \u003e= 16GB VRAM. Tested on NVIDIA RTX 4090 (24GB).\n\n## Pre-trained Models\n\nCheckpoints are available on HuggingFace: **[syvb/cocoracle](https://huggingface.co/syvb/cocoracle)**\n\n- `stage3_alllatent.pt` — GPT-2-large Coconut model, all-latent (45% accuracy)\n- `self_oracle_alllatent.pt` — GPT-2-large self-oracle (6.9% CoT exact match)\n\n## What Would Improve This Further\n\n1. **More training**: AO loss was still decreasing at epoch 5 (0.55 and dropping). Training for 20+ epochs would likely push exact match higher.\n2. **Larger models**: The AO paper succeeds at 8B+ scale. A Qwen-8B or Llama-8B Coconut model with self-oracle training should produce much better results.\n3. **Better Coconut training**: Our all-latent model only achieves 45% accuracy. The original Coconut paper uses more data and longer training. A 90%+ all-latent model would produce much richer latent states.\n4. **Multi-layer injection**: Currently we inject at a single layer. Injecting the same activation at multiple layers could strengthen the signal.\n5. **Diverse training tasks**: The AO paper uses classification, context prediction, and system prompt QA. We only use arithmetic. More diverse tasks could improve generalization.\n\n## References\n\n- Hao et al., \"Training Large Language Models to Reason in a Continuous Latent Space\" (2024). [arXiv:2412.06769](https://arxiv.org/abs/2412.06769)\n- Karvonen et al., \"Activation Oracles: Training and Evaluating LLMs as General-Purpose Activation Explainers\" (2026). [arXiv:2512.15674](https://arxiv.org/abs/2512.15674)\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fsyvb%2Fcocoracle","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Fsyvb%2Fcocoracle","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fsyvb%2Fcocoracle/lists"}