{"id":35015638,"url":"https://github.com/mdda/getting-to-aha-with-tpus","last_synced_at":"2025-12-27T05:19:29.070Z","repository":{"id":278555064,"uuid":"926917930","full_name":"mdda/getting-to-aha-with-tpus","owner":"mdda","description":"Reasoning-from-Zero using gemma.JAX.nnx on TPUs","archived":false,"fork":false,"pushed_at":"2025-12-17T08:25:35.000Z","size":197,"stargazers_count":9,"open_issues_count":0,"forks_count":0,"subscribers_count":3,"default_branch":"main","last_synced_at":"2025-12-20T21:46:21.321Z","etag":null,"topics":["gemma","jax","nnx","reasoning-language-models","tpu"],"latest_commit_sha":null,"homepage":"","language":"Python","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":null,"status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/mdda.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":null,"license":null,"code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null,"roadmap":null,"authors":null,"dei":null,"publiccode":null,"codemeta":null,"zenodo":null,"notice":null,"maintainers":null,"copyright":null,"agents":null,"dco":null,"cla":null}},"created_at":"2025-02-04T04:40:35.000Z","updated_at":"2025-12-17T08:25:39.000Z","dependencies_parsed_at":"2025-03-09T20:18:40.217Z","dependency_job_id":"053bff16-ac0a-43f5-ab58-9b6d455b4520","html_url":"https://github.com/mdda/getting-to-aha-with-tpus","commit_stats":null,"previous_names":["mdda/getting-to-aha-with-tpus"],"tags_count":0,"template":false,"template_full_name":null,"purl":"pkg:github/mdda/getting-to-aha-with-tpus","repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/mdda%2Fgetting-to-aha-with-tpus","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/mdda%2Fgetting-to-aha-with-tpus/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/mdda%2Fgetting-to-aha-with-tpus/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/mdda%2Fgetting-to-aha-with-tpus/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/mdda","download_url":"https://codeload.github.com/mdda/getting-to-aha-with-tpus/tar.gz/refs/heads/main","sbom_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/mdda%2Fgetting-to-aha-with-tpus/sbom","scorecard":null,"host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":286080680,"owners_count":28072870,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","status":"online","status_checked_at":"2025-12-27T02:00:05.897Z","response_time":58,"last_error":null,"robots_txt_status":"success","robots_txt_updated_at":"2025-07-24T06:49:26.215Z","robots_txt_url":"https://github.com/robots.txt","online":true,"can_crawl_api":true,"host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["gemma","jax","nnx","reasoning-language-models","tpu"],"created_at":"2025-12-27T05:19:27.971Z","updated_at":"2025-12-27T05:19:29.059Z","avatar_url":"https://github.com/mdda.png","language":"Python","funding_links":[],"categories":[],"sub_categories":[],"readme":"# Getting to Aha\n## With TPU(s) \u003cstrike\u003eusing JAX nnx\u003c/strike\u003e\n\n* Reasoning-from-Zero using TPUs for compute\n  + Following the release of DeepSeek's R1 model, there was a nice follow-up from a group at Berkeley with a 'Countdown task reasoner' that can be trained from scratch for \"$30 of H100s\" (https://github.com/Jiayi-Pan/TinyZero)\n  + The aim of this project is to replicate that same task, but using a gemma2 model, and TPU infrastructure\n  + This will make it far, far more likely that TPUs could become an experimentation platform for the curious : The current barriers to entry are very high\n\n\n## The Plan\n\n* Use `gemma2-2B-base` on:\n  + Kaggle TPU v3-8; and \n  + Colab TPU v2-8 (potentially - it would be very tight)\n* Reasoning task : Countdown task from TinyZero\n  + RL objective : GRPO\n* Goal : Get to \"Aha!\" using \\$free TPU resources\n  + with codebase that is:\n    * Plain and Readable (i.e. not simply dressing up a call to `trl`)\n    * Hackable (i.e. can implement more than the demo case)\n\n\n### Decision : Which framework?\n\n* [JAX `flax.nnx` examples/gemma](https://github.com/google/flax/tree/main/examples/gemma) (i.e. *new* style)\n  + Positives: \n    - Framework being promoted as PyTorch-user-friendly\n  + Negatives:\n    - Early days (PROVEN)\n    - `gemma` example in [`nnx` documentation](https://flax.readthedocs.io/en/latest/guides/gemma.html) does not work\n      * [PR submitted to fix glaring error(s)](https://github.com/google/flax/pull/4587)\n    - `nnx.jit` of Transformer forward pass proven to take \u0026gt;60Gb RAM during compilation\n      * (it would only not crash the VM if the instance had \u0026lt;70Gb available RAM)\n      * Therefore impractical for use on Colab/Kaggle == DEAD END\n* [Google-DeepMind `gemma` library]() in JAX `flax.linen` (i.e. *old* style)\n  + Positives:\n    - The library actually works with Gemma2\n      * And consumes \u0026lt;1Gb RAM doing `jit` on forward pass / sampling\n    - Library has LoRA and sharding\n  + Negatives:\n    - Flax/linen is (according to the `nnx` docs) backward-looking\n    - Heavy dependency on `kauldron` for training (and\n      [LoRA](https://github.com/google-deepmind/gemma/blob/main/examples/lora.py#L53),\n      [sharding](https://github.com/google-deepmind/gemma/blob/main/examples/sharding.py#L44), etc)\n      * Undermines the goal of using plain, readable code \n    - GDM `gemma` library transformer [Sampler is greedy-only](https://github.com/google-deepmind/gemma/blob/main/gemma/sampler.py#L145) \n      * Monkey-patching this functionality (which is deep inside the `Sampler` class) would smell bad\n      * So adding library features would have to be done before beginning\n* [`pytorch-gemma`](https://github.com/google/gemma_pytorch/) library for PyTorch/XLA\n  + Positives:\n    - Library appears ready for CPU, GPU and TPU\n    - Includes distribution code (with Column-wise and Row-wise Linear implementations)\n    - Includes 8-bit quantised code\n  + Negatives:\n    - Does not appear to include LoRA\n      * Though may be compatible with PEFT (needs testing)\n      * How does auto-LoRA interact with sharding?  Eeek\n    - While PyTorch XLA is clearly ['real'](https://github.com/google/gemma_pytorch/blob/main/scripts/run_xla.py#L33) ...\n      * Need to test whether XLA code can get 'compiled' in a similar way to JAX `jit`  \n* [Keras gemma implementation](https://keras.io/keras_hub/api/models/gemma/gemma_causal_lm/) using JAX backend\n  + Positives:\n    - Ecosystem appears ready for CPU, GPU and [TPU](https://www.kaggle.com/code/matthewdwatson/gemma-2-tpu-fine-tuning)\n    - Includes LoRA, more sophisicated sampling and distribution over TPUs\n    - Actually *proven to work* on TPUs via Colab (in this Repo)\n  + Negatives:\n    - IMHO, Keras is perceived as being somewhat *lame* vs other frameworks\n    - Still need to test whether fancy sampling, fancy distribution strategy, and custom training step (GRPO)\n      can be implemented *at the same time*\n\nSo far: \n* `nnx` has suceeded in:\n  + causing me to labouriously debug and fix the example library \n  + wasting many GPU hours frustratedly trying to `nnx.jit` things without crashing the VM\n* `gemma` (GDM library) \n  + only has a greedy Sampler - which would need fixing\n  + relies very heavily on `kauldron` to do fancy things\n* PyTorch/XLA `pytorch_gemma` looks interesting, though would need:\n  + LoRA to be added (ideally using PEFT)\n  + actual benchmarking on TPUs vs JAX (time-consuming)\n* Keras.JAX seems likely to be a good basis,\n  + and has *started to show signs of life*\n  + though it remains to be seen whether it works as advertised as the model/RL gets more complex\n\n--- \n\n## Installation / Running the code\n\n```bash\nsudo snap install astral-uv --classic\nuv venv env_flax.nnx\n. ./env_flax.nnx/bin/activate\nuv pip install jupyterlab ipywidgets jupytext OmegaConf\n```\n\n* Run jupyterlab notebook enviroment:\n```bash\njupytext --set-formats cache-notebooks//ipynb,py:light *.py\n#...\njupyter-lab --port 8282 --no-browser\n```\n\n* Test the countdown puzzle generation:\n```bash\npushd ./aha_dataset/countdown/\npython generator.py \npopd\n```\n\n---\n\n## RL-related Resources\n\n### Post-R1 GRPO demos\n\n* [Experience the Ahah moment yourself for \u0026lt;\\$30](https://github.com/Jiayi-Pan/TinyZero)\n  + Berkeley : Jiayi Pan=@jiayi_pirate, @JunjieZhang12, @xingyaow_, @lifan__yuan\n  + [Author Twitter thread](https://x.com/jiayi_pirate/status/1882839370505621655)\n  + TinyZero is a reproduction of DeepSeek R1 Zero in countdown and multiplication tasks\n    = We built upon `veRL`\n  + CountDown: a game where players combine numbers with basic arithmetic to reach a target number\n    + [Scoring Function](https://github.com/Jiayi-Pan/TinyZero/blob/main/verl/utils/reward_score/countdown.py#L59), \n    [Dataset with correct answers](https://huggingface.co/datasets/Jiayi-Pan/Countdown-Tasks-3to4) loaded [here](https://github.com/Jiayi-Pan/TinyZero/blob/main/examples/data_preprocess/countdown.py)\n    + This is a bit strange, since the rows of the dataset does not include `{+,-,*,/}` answers\n      - ... presumably there is a way to get the `target` from the `nums` ...\n      - Maybe: [generated by exhuastive search](https://github.com/kanishkg/stream-of-search/blob/main/src/countdown_generate.py)\n  + Tried : Qwen-2.5-Base 0.5B, 1.5B, 3B, 7B\n    - \u0026gt;1.5B models start learning to search, to self-verify and to revise solutions\n  + Either base or instruct model works \n    - Converge to same performance (instruct learns more quickly)\n    - Instruct model's output are more structured and readable\n  + PPO, GRPO and PRIME all worked\n* [Mini-R1: Reproduce Deepseek R1 \"aha moment\" - an RL tutorial](https://www.philschmid.de/mini-deepseek-r1)\n  + = same as above\n  + \"This blog is inspired by Jiayi Pan who initially explored the idea and proofed it with a small model.\"\n\n\n* [willccbb/grpo_demo.py](https://gist.github.com/willccbb/4676755236bb08cab5f4e54a0475d6fb)\n  + Will Brown = @willccbb\n  + Code gist, with comments from implementers / testers\n    - Llama 1B, GSM8k\n    - `from peft import LoraConfig`\n    - `from trl import GRPOConfig, GRPOTrainer`\n    - beta: (float, optional, defaults to 0.04) — KL coefficient\n      + Commenter had success with beta=0.01\n      + https://huggingface.co/docs/trl/main/en/grpo_trainer#trl.GRPOConfig.beta\n  + (updated code, running smooth now on Qwen-1.5B w/ longer max_completion_length + higher num_generations)\n  + \"TRL GRPO has vLLM now btw + it's soooo much faster wow\"\n  + [Next version (?) uses TRL_GRPOTrainer](https://x.com/willccbb/status/1886243810323148890)\n  + [Colab version with Qwen 0.5B](https://colab.research.google.com/drive/1bfhs1FMLW3FGa8ydvkOZyBNxLYOu0Hev?usp=sharing)\n    - Runs `vLLM` on Colab GPU too\n    - Decent looking code, but Aha is not directly visible...\n  + This used by [@anton](https://x.com/abacaj/status/1884361852349825444)\n    - [Qwen2.5-0.5B (base model) directly goes into step by step breakdown with zero prompting](https://x.com/abacaj/status/1888826994248323354) (and Llama doesn't produce step-wise thinking of its own accord)\n    - [when reward starts going up at step \u0026gt;100 it's either hacking it or discovered something](https://x.com/abacaj/status/1889739663855743472)\n    - See below...\n\n* @anton experiments\n  + [\"Perfect Reward Function\"](https://x.com/abacaj/status/1884787697408999619)\n  + [\"Finished a run (R1 style) GRPO on Qwen-2.5-0.5B (base model) yield +10 accuracy points on GSM8K. Literally just works\"](https://x.com/abacaj/status/1885517088304857197)\n    - Two tricks I found work well : \n      * use a good system prompt, and \n      * try lower beta (KL coefficient). \n    - 3 rewards: int reward, final_answer tags, and correctness reward\n    - has commented on original `willccbb/grpo_demo.py` gist\n      + Has own gist of [GRPOTrainer to run gsm8k eval during training](https://gist.github.com/abacaj/9a567910c1a8663f7aa04520075e0ba8)\n  + [\"Got a better result on qwen2.5-0.5b (base) \u0026rarr; 56% gsm8k\"](https://x.com/abacaj/status/1886308242814320834)\n\n\n* [Full GRPO fine-tuning Qwen2.5 0.5B on a single T4](https://gist.github.com/qunash/820c86d1d267ec8051d9f68b4f4bb656)\n  + @qunash on GitHub = https://huggingface.co/anzorq \n  + Fork of the TRL repo by [GitHub @andyl98](https://github.com/andyl98) - with more optimisations\n  + `Qwen2.5-0.5B-Instruct` gsm8k eval result from 22.4% to 48.6% \n    - in just ~150 steps (~30 minutes) on a single T4 GPU\n\n\n* [Train your own R1 reasoning model with Unsloth](https://unsloth.ai/blog/r1-reasoning)\n  + [Daniel Han (unsloth) thread](https://x.com/danielhanchen/status/1887564724071768529)\n    - We removed double memory usage during vLLM serving and finetuning\n    - 70% less VRAM finetuning and 20x faster inference all in one package! \n    - LoRA / QLoRA also originally *did not work* for people when doing GRPO in the starter script\n  + [unsloth thread](https://x.com/UnslothAI/status/1887562753126408210)\n  + [GRPO with unsloth on free colab](https://colab.research.google.com/drive/1P7frB3fjMv6vjSINqiydAf6gnMab2TiL?usp=sharing)\n    - \"it's painfully slow; but works :p\"\n    - Exposes code from TRL training loop a little...\n    - `model=\"Qwen/Qwen2-0.5B-Instruct\", reward_funcs=\"weqweasdas/RM-Gemma-2B\",` ... reward model?\n  + [Commentary](https://x.com/Hesamation/status/1888285721863004411)\n    - GRPO is now optimized to use 80% less VRAM\n    - GRPO now with LoRA and QLoRA\n    - Qwen2.5(1.5B) can be trained with just 7GB!\n    - Llama3.1(8B) training with 15GB\n\n\n* SimpleRL : [7B Model and 8K MATH Examples: Emerging Reasoning with Reinforcement Learning is Both Effective and Efficient](https://hkust-nlp.notion.site/simplerl-reason)\n  + Weihao Zeng, Yuzhen Huang, Wei Liu, Keqing He, Qian Liu, Zejun Ma, Junxian He=@junxian_he\n : hkust\n    \u003e We reproduce the training of DeepSeek-R1-Zero and DeepSeek-R1 for complex mathematical reasoning, \n    \u003e starting from Qwen-2.5-Math-7B (base model), \n    \u003e and only using 8K (query, final answer) examples from the original MATH dataset. \n  + [Code on GitHub](https://github.com/hkust-nlp/simpleRL-reason)\n    - We are working on the paper and will release it very soon\n    - Uses OpenRLHF\n\n* [\"R1-V: Reinforcing Super Generalization Ability in Vision Language Models\"](https://x.com/liangchen5518/status/1886171667522842856)\n  + Liang Chen = @liangchen5518\n  + https://github.com/Deep-Agent/R1-V  \n    - Cost\u0026lt;\\$3 : 8 A100 GPUs for 30 minutes\n    - 100 training steps\n\n* The Thought Process Behind Kimi k1.5 \n  + [Kimi k1.5: Scaling Reinforcement Learning with LLMs](https://arxiv.org/abs/2501.12599)\n  + [Informative Author Thread](https://x.com/Kimi_Moonshot/status/1882413059513471044)\n\n* [s1: Simple test-time scaling](https://arxiv.org/abs/2501.19393)\n  + Add 'Wait!' when model wants to do '\u0026lt;/think\u0026gt;' to extend thought process\n  + SFT on thought traces from ...?\n  + [s1: The \\$6 R1 Competitor?](https://timkellogg.me/blog/2025/02/03/s1)\n    - [Entropix Tie In](https://timkellogg.me/blog/2024/10/10/entropix) - in entropix, extra 'encouragement' tokens were added in... So: similar idea\n  + [repo on GitHub](https://github.com/simplescaling/s1)\n  + [Project Page](https://simplescaling.github.io/)\n  + Frugality:\n    - Sifted their dataset of 56K examples down to just the best 1K, \n    - the core 1K is all that's needed to achieve o1-preview performance on a 32B model.\n    - Adding data didn't raise performance at all.\n  + [s1.1 : trained on same 1K questions](https://x.com/Muennighoff/status/1889310803746246694)\n    - DeepSeek answers, rather than Gemini generations\n    - As it is just 1K examples, training is extremely cheap and took just 26 minutes\n    - To control test-time compute, we develop “budget forcing”:\n      * We either force the model to end its thinking or \n      * extend it by appending Wait when the model tries to stop\n      * This simple method improves our model\n  + GDE Blogpost : [s1 and s1.1](https://gonzoml.substack.com/p/s1-simple-test-time-scaling)\n\n\n#### Contrarian Ideas\n\n* [There May Not be Aha Moment in R1-Zero-like Training — A Pilot Study](https://oatllm.notion.site/oat-zero)\n  + SEA AI Labs in SG (!)\n  + [OAT-Zero code on GitHub](https://github.com/sail-sg/oat-zero)\n  + Key points:\n    - \"We found Aha moment (such as self-reflection patterns) appears at epoch 0, namely base models\"\n    - \"Superficial Self-Reflection (SSR) from base models' responses\" - leading to wrong answer\n    - \"increasing response length phenomenon not emergence .. but RL optimizing rule-based reward\"\n  + [OAT RL library](https://github.com/sail-sg/oat) - A research-friendly framework for LLM online alignment\n\n\n\n### GRPO expositions\n\n* [GRPO with Verifiable (Binary) Rewards Is an Adaptive Weighted Contrastive Loss](https://ymroueh.me/post/post_1/)\n  + IBM researcher : Breaks down whitening into the factors\n* [Nathan Lambert Book on Blog](https://rlhfbook.com/c/11-policy-gradients.html#group-relative-policy-optimization-1)\n\n* [Fine-tuning 20B LLMs with RLHF on a 24GB consumer GPU](https://huggingface.co/blog/trl-peft)\n  + Includes PEFT and `trl` (9-March-2023)\n\n* [GRPO also works very well for Llama 2 7B, with an impressive +15 accuracy point increase in GSM8K](https://x.com/rdolmedo_/status/1886505669622149139)\n  + \"There's nothing magical about recent model families. If the model can perform some task with sufficient accuracy, then RL with verifiable rewards will likely boost performance\"\n  + [Run it yourself: `RicardoDominguez/grpo_llama2-7b-chat_gsm8k.sh`](https://gist.github.com/RicardoDominguez/72603d278ed26f0dd55af6ffd414b797)\n    - Seems like unrolled code from TRL ... everything is there\n\n\n### GRPO Hints\n \n* [DeepSeek R1 training is straight-forward, UNTIL you understand the complexities in writing GRPO Verifiers](https://x.com/bookwormengr/status/1888530568645861865)\n  + Somewhat ranty\n\n* Trellis video series:\n  + 1: [Reinforcement Learning for LLMs in 2025](https://www.youtube.com/watch?v=C4HxJQ2QzWo)\n    - Set-up of training, with curation of SFT data (mostly)\n  + 2: [How does GRPO work?](https://www.youtube.com/watch?v=iHlarYGLMbY)\n    - 32mins : TODO:WATCH!\n\n* [GRPO implementation update](https://github.com/allenai/open-instruct/issues/534#issuecomment-2634656168)\n  + Fixing up the implementation in AllenAI RL library\n  + [Other comments](https://x.com/vwxyzjn/status/1885329398821187633):\n    - When directly minimizing the KL loss, kl3 just appears much more numerically stable. \n    - And the \u0026gt;0 guarantee here is also really nice (kl1 could go negative).\n  + [John Schulman's Homepage : Approximating KL Divergence](http://joschu.net/blog/kl-approx.html)\n  + BUT ... [LMs with GRPO etc with KL penalty = 0 works](https://x.com/natolambert/status/1890071898869907646)\n    - \"These are from experiments and this is not official training advice.\"\n\n\n### GRPO libraries\n\n* [GRPO from DeepSeek-R1 is now available in Hugging Face `trl` library](https://x.com/Hesamation/status/1882001485636178414)\n  + [`GRPOTrainer` Docs](https://huggingface.co/docs/trl/main/en/grpo_trainer)\n  + KL divergence is *estimated* using the approximator introduced by Schulman et al. (2020)\n    - The approximator is defined as follows:  `p_ratio - log(p_ratio) - 1`\n  + Has a `use_vllm=True` parameter to do generations using `vllm`\n  + [\"just a reminder : trl grpo is not same as same as described in deepseek paper\"](https://x.com/shxf0072/status/1886390053104242983)\n    - No clipping objective (though does have KL term) (may not be important at all)\n      + Maybe [KL term not needed with verifiable rewards](https://x.com/shxf0072/status/1892687698261139566)\n    - Also \"Joey (e/λ)\" has [comments about gradient / loss and removing constants](https://x.com/shxf0072/status/1892668791303373042)...  \n      + Claim : \"loss =  advantage*log_softmax(logits) works, same gradients\"\n      + (Makes sense at first glance, but not clear whether there's something else going on)\n* [OpenRLHF](https://github.com/OpenRLHF/OpenRLHF)\n  + High-performance RLHF framework built on Ray, DeepSpeed and HF Transformers\n* [veRL](https://github.com/volcengine/verl)\n  + Volcano Engine Reinforcement Learning for LLM\n\n\n### R1 Notes\n\n* https://x.com/Guodaya/status/1886635010251518330 (now deleted)\n  + =Researcher at DeepSeek \n  + The 660B R1-Zero and R1 began running after the release of V3, with training taking approximately 2-3 weeks\n  + The R1 model prior to this time (e.g., in the V3 tech report) was the R1-Lite or the R1-Lite-Zero\n\n\n### Miscellaneous\n\n* [GRPO VRAM Requirements For the GPU Poor](https://www.oxen.ai/blog/grpo-vram-requirements-for-the-gpu-poor)\n  + Points out RAM requirements (with potential torch ideas)\n  + GRPO explanation not very useful\n\n* [The N Implementation Details of RLHF with PPO](https://huggingface.co/blog/the_n_implementation_details_of_rlhf_with_ppo)\n  + 2023-10-24\n\n* [The 37 Implementation Details of Proximal Policy Optimization](https://iclr-blog-track.github.io/2022/03/25/ppo-implementation-details/)\n  + 2022-03-25\n\n\n---\n\n## Potential next ideas\n\n### RL on Deepseek 'hard distilled' models\n\n* [DeepScaleR: Surpassing O1-Preview with a 1.5B Model by Scaling RL](https://pretty-radio-b75.notion.site/DeepScaleR-Surpassing-O1-Preview-with-a-1-5B-Model-by-Scaling-RL-19681902c1468005bed8ca303013a4e2)\n  + Berkeley Sky Computing Lab (not the same authors as original \\$30 one, AFAICT)\n  + \"1.5B model beats o1-preview on math by RL\"\n  + Cost:\n    - Overall, our training run consists of ~1,750 steps. \n    - The initial 8K context phase was trained on 8 A100 GPUs, \n    - while the 16K and 24K phases scaled up training to 32 A100 GPUs. \n    - In total, the training took ~3,800 A100 hours = roughly 5 days on 32 A100s\n    - \\$4500 in compute cost\n  + Reddit discussion [DeepScaleR-1.5B-Preview](https://www.reddit.com/r/LocalLLaMA/comments/1imm4wc/deepscaler15bpreview_further_training/)\n  + [Model on HF](https://huggingface.co/agentica-org/DeepScaleR-1.5B-Preview)\n  + [Project on GitHub](https://github.com/agentica-project/deepscaler) \n    - uses their own [veRL](https://github.com/agentica-project/verl)\n  + [\"DeepScaleR is by far the most sophisticated and impressive thing built on R1 this far\"](https://x.com/teortaxesTex/status/1889914611555865007)\n    - Maximizing intelligence per FLOP is a natural step after test time unlock\n\n\n\n### Agentic RAG\n\n* [Agentic RAG systems and taxonomy](https://x.com/tom_doerr/status/1889905154465448265)\n  + [Actual repo](https://github.com/asinghcsu/AgenticRAG-Survey)\n\n* [Eliciting Critical Reasoning in Retrieval-Augmented Language Models via Contrastive Explanations](https://arxiv.org/abs/2410.22874)\n* [Self-RAG: Learning to Retrieve, Generate, and Critique through Self-Reflection](https://arxiv.org/abs/2310.11511)\n* [Chain-of-Retrieval Augmented Generation](https://arxiv.org/abs/2501.14342)\n  + Microsoft\n  + More than 10 points improvement in EM score compared to strong baseline\n  + Establishes a new SotA performance across a diverse range of knowledge-intensive tasks\n\n\n#### Datasets\n\n* [BERGEN: A Benchmarking Library for Retrieval-Augmented Generation](https://arxiv.org/abs/2407.01102)\n  + [No results yet?](https://paperswithcode.com/paper/bergen-a-benchmarking-library-for-retrieval)\n* [Benchmarking Large Language Models in Retrieval-Augmented Generation](https://arxiv.org/abs/2309.01431)\n* [LegalBench-RAG: A Benchmark for Retrieval-Augmented Generation in the Legal Domain](https://arxiv.org/abs/2408.10343)\n  + [LegalBench-RAG](https://github.com/ZeroEntropy-AI/legalbenchrag)\n* [Fact, Fetch, and Reason: A Unified Evaluation of Retrieval-Augmented Generation](https://arxiv.org/abs/2409.12941)\n  + [FRAMES: Factuality, Retrieval, And reasoning MEasurement Set](https://huggingface.co/datasets/google/frames-benchmark)\n* [CRAG: Comprehensive RAG Benchmark](https://github.com/facebookresearch/CRAG)\n  + [KDD Task - with starter kit](https://gitlab.aicrowd.com/aicrowd/challenges/meta-comprehensive-rag-benchmark-kdd-cup-2024)\n* [Natural Questions: A Benchmark for Question Answering Research](https://aclanthology.org/Q19-1026/)\n  + [Google : NaturalQuestions](https://ai.google.com/research/NaturalQuestions/dataset)\n  + [Natural Questions SoTA](https://paperswithcode.com/sota/question-answering-on-natural-questions)\n\n\n### Agent RL\n\n* [Grounding Large Language Models in Interactive Environments with Online Reinforcement Learning](https://arxiv.org/abs/2302.02662)\n  + T5 (in 2023-02)\n* [RAGEN: A General-Purpose Reasoning Agent Training Framework](https://github.com/ZihanWang314/ragen)\n  + Code on GitHub\n  + [Author Thread](https://x.com/wzihanw/status/1884092805598826609)\n  + We run RAGEN on the Gym-Sokoban task: \n    - Qwen-2.5-{0.5B, 3B}-{Instruct, None}\n    - DeepSeek-R1-Distill-Qwen-1.5B\n* [Scaled Cognition: \"first ever models trained specifically for agentic applications\"](https://x.com/ScaledCognition/status/1889721166421479751)\n  - \"APT-1, is now #1 on agentic benchmarks\" ...\n\n\n### Task Vectors\n\n* https://x.com/chrisbarber/status/1885047105741611507\n  + Shannon Sands (@max_paperclips) from @NousResearch\n  + backtracking vector \n    - \"caused the chain of thought to backtrack much more often, \n      and when suppressed caused it to be a linear and much shorter CoT\"\n\n\u003c!--\n### Cryptic Crosswords\n!--\u003e\n\n---\n\n## TPU Resources\n\nSee [this page](TPU.md) for:\n* JAX Resources\n  + JAX (generic)\n  + Gemma Models\n  + Keras (JAX backend)\n  + LoRA for JAX\n* TPU training\n  + TPU training (Node-style TPUs = old, including Colab)\n  + TPU training (VM-style TPUs = modern)\n\n\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fmdda%2Fgetting-to-aha-with-tpus","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Fmdda%2Fgetting-to-aha-with-tpus","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fmdda%2Fgetting-to-aha-with-tpus/lists"}