{"id":31688382,"url":"https://github.com/aicomputing101/reinforcement-learning-101","last_synced_at":"2026-04-15T16:03:38.168Z","repository":{"id":318094986,"uuid":"1050752584","full_name":"AIComputing101/reinforcement-learning-101","owner":"AIComputing101","description":"An opinionated, end‑to‑end tutorial project for learning Reinforcement Learning (RL) from first principles to deployment. No notebooks. Everything is an explicit, inspectable Python script you can diff, profile, containerize, and ship.","archived":false,"fork":false,"pushed_at":"2025-10-05T03:06:44.000Z","size":212,"stargazers_count":1,"open_issues_count":0,"forks_count":0,"subscribers_count":0,"default_branch":"main","last_synced_at":"2025-10-05T04:23:48.949Z","etag":null,"topics":["deep-reinforcement-learning","distributed-computing","docker-container","gpu-programming","q-learning","reinforcement-learning","reinforcement-learning-agent","reinforcement-learning-algorithms","rlhf"],"latest_commit_sha":null,"homepage":"","language":"Python","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":"apache-2.0","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/AIComputing101.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":"CONTRIBUTING.md","funding":".github/FUNDING.yml","license":"LICENSE","code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null,"roadmap":null,"authors":null,"dei":null,"publiccode":null,"codemeta":null,"zenodo":null,"notice":null,"maintainers":null,"copyright":null,"agents":null,"dco":null,"cla":null},"funding":{"github":"coketaste","patreon":null,"open_collective":null,"ko_fi":null,"tidelift":null,"community_bridge":null,"liberapay":null,"issuehunt":null,"lfx_crowdfunding":null,"polar":null,"buy_me_a_coffee":null,"thanks_dev":null,"custom":null}},"created_at":"2025-09-04T22:10:24.000Z","updated_at":"2025-10-05T03:06:47.000Z","dependencies_parsed_at":"2025-10-05T04:23:57.686Z","dependency_job_id":"274f16a0-cb92-431f-9bff-34124525930d","html_url":"https://github.com/AIComputing101/reinforcement-learning-101","commit_stats":null,"previous_names":["aicomputing101/reinforcement-learning-101"],"tags_count":null,"template":false,"template_full_name":null,"purl":"pkg:github/AIComputing101/reinforcement-learning-101","repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/AIComputing101%2Freinforcement-learning-101","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/AIComputing101%2Freinforcement-learning-101/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/AIComputing101%2Freinforcement-learning-101/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/AIComputing101%2Freinforcement-learning-101/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/AIComputing101","download_url":"https://codeload.github.com/AIComputing101/reinforcement-learning-101/tar.gz/refs/heads/main","sbom_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/AIComputing101%2Freinforcement-learning-101/sbom","scorecard":null,"host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":278931662,"owners_count":26070788,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","status":"online","status_checked_at":"2025-10-08T02:00:06.501Z","response_time":56,"last_error":null,"robots_txt_status":"success","robots_txt_updated_at":"2025-07-24T06:49:26.215Z","robots_txt_url":"https://github.com/robots.txt","online":true,"can_crawl_api":true,"host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["deep-reinforcement-learning","distributed-computing","docker-container","gpu-programming","q-learning","reinforcement-learning","reinforcement-learning-agent","reinforcement-learning-algorithms","rlhf"],"created_at":"2025-10-08T10:55:37.099Z","updated_at":"2025-10-08T10:55:42.951Z","avatar_url":"https://github.com/AIComputing101.png","language":"Python","funding_links":["https://github.com/sponsors/coketaste"],"categories":[],"sub_categories":[],"readme":"# Reinforcement Learning 101\n\n**Progressive, Hands‑On Reinforcement Learning Project (CLI-First)**\nBuilt for clarity, reproducibility, and production awareness.\n\n![python](https://img.shields.io/badge/python-3.11%20|%203.12-blue)\n![license](https://img.shields.io/badge/license-Apache--2.0-green)\n![logging](https://img.shields.io/badge/logging-rich%20console-purple)\n![status](https://img.shields.io/badge/status-active-success)\n\nAn opinionated, end‑to‑end tutorial project for learning Reinforcement Learning (RL) from first principles to deployment. **No notebooks.** Everything is an explicit, inspectable Python script you can diff, profile, containerize, and ship.\n\n📊 **[View Visual Diagrams](docs/diagrams/)** | 🎨 **[Algorithm Taxonomy](docs/diagrams/algorithm_taxonomy.png)** | 🗺️ **[Learning Path](docs/diagrams/learning_path.png)**\n\n---\n\n## Table of Contents\n1. [Who Is This For?](#1-who-is-this-for)\n2. [Learning Outcomes](#2-learning-outcomes)\n3. [Quick Start](#3-quick-start)\n4. [Prerequisites](#4-prerequisites)\n5. [Module Path](#5-module-path-progressive-difficulty)\n   - [5.1 Algorithm Selection Guide](#51-algorithm-selection-guide---which-algorithm-should-i-use) ⭐ \n6. [Project Layout](#6-project-layout)\n7. [Running Examples](#7-running-examples-more-highlights)\n8. [Environment \u0026 Reproducibility](#8-environment--reproducibility)\n9. [Best Practices](#9-best-practices)\n10. [GPU \u0026 Docker](#10-gpu--docker)\n11. [Dependencies](#11-dependencies)\n12. [Testing \u0026 Fast Validation](#12-testing--fast-validation)\n13. [Troubleshooting](#13-troubleshooting)\n14. [Extending the Project](#14-extending-the-project)\n15. [Contributing](#15-contributing)\n16. [Roadmap](#16-roadmap)\n17. [References](#17-references)\n18. [License](#18-license)\n19. [Citation (Optional)](#19-citation-optional)\n20. [FAQ](#20-faq)\n\n---\n\n## 1. Who Is This For?\nLearners who want a structured, hands-on path:\n* You know basic Python \u0026 NumPy, maybe a little PyTorch.\n* You want to understand RL algorithms by reading and running minimal reference implementations.\n* You prefer reproducible scripts over exploratory notebooks.\n* You eventually want to operationalize RL (serving, batch/offline, containers, Kubernetes).\n\nIf you just want a black‑box library, this project intentionally is not that. It shows the scaffolding explicitly.\n\n## 2. Learning Outcomes\nBy completing all 7 modules you will be able to:\n* Implement and compare exploration strategies in multi‑armed bandits.\n* Derive and code tabular Q-Learning; extend to deep value methods (DQN family, Rainbow components).\n* Train policy gradient and REINFORCE baselines; reason about variance \u0026 baselines.\n* Build actor‑critic agents (A2C, **PPO**, **TD3**, SAC, TRPO) and understand stability trade‑offs.\n* Master **industry-standard algorithms** used in production (ChatGPT's RLHF uses PPO).\n* **Apply cutting-edge algorithms (2024-2025)**: Offline RL (CQL, IQL), Model-Based (Dreamer), RLHF for LLMs.\n* Experiment with advanced ideas (evolutionary strategies, curiosity, multi-agent coordination).\n* Apply RL framing to industry‑style scenarios (bidding, recommendation, energy control).\n* Package, serve, and batch‑evaluate trained agents (TorchServe, Ray RLlib, Kubernetes jobs).\n* **Optimize training infrastructure**: GPU acceleration, distributed training, hyperparameter tuning.\n\n## 3. Quick Start\n\n### Automated Setup (Recommended)\nOne command with GPU auto-detection:\n```bash\n./setup.sh\n```\n\nOr choose your backend directly:\n```bash\n./setup.sh native          # Native Python (auto-detect GPU)\n./setup.sh docker cuda     # Docker with NVIDIA GPU\n./setup.sh native cpu      # CPU-only native setup\n```\n\n### Manual Setup (CPU)\n```bash\npython3 -m venv .venv\nsource .venv/bin/activate  # Windows: .venv\\Scripts\\activate\npip install -r requirements/requirements-base.txt\npip install -r requirements/requirements-torch-cpu.txt\n```\n\n**Verify installation:**\n```bash\n# Run all tests (comprehensive)\npython scripts/smoke_test.py\n\n# Quick test - core examples only (no PyTorch)\npython scripts/smoke_test.py --core-only\n\n# Skip optional/slow tests\npython scripts/smoke_test.py --skip-optional\n\n# Test specific group\npython scripts/smoke_test.py --group deep-rl\n```\n\nThe runner now inspects your environment up front: it reports detected versions of PyTorch, Ray, and Optuna, and will automatically skip optional checks that require missing extras (for example, `ppo_lunarlander.py` is skipped if `Box2D` is unavailable). Longer-running jobs such as `sac_robotic_arm.py` live in the optional Infrastructure group—use `--skip-optional` for the fastest required pass.\n\nRun your first bandit:\n```bash\npython modules/module_01_intro/examples/bandit_epsilon_greedy.py --arms 10 --steps 2000 --epsilon 0.1\n```\n\nRun a value method (needs PyTorch):\n```bash\npython modules/module_02_value_methods/examples/dqn_cartpole.py --episodes 400 --learning-rate 1e-3\n```\n\nDiscover flags for any script:\n```bash\npython path/to/script.py --help\n```\n\n## 4. Prerequisites\n* Python 3.11 or 3.12 (3.13 is not yet supported by official PyTorch wheels—use Docker if you’re on a newer interpreter)\n* Basic linear algebra \u0026 probability familiarity\n* Optional GPU (CUDA or ROCm) for heavier experiments\n\n## 5. Module Path (Progressive Difficulty)\n| Module | Theme | Core Topics | Sample Command |\n|--------|-------|------------|----------------|\n| 01 Intro | Bandits | Epsilon-greedy, UCB, exploration vs exploitation | `bandit_epsilon_greedy.py --arms 10 --steps 2000` |\n| 02 Value Methods | Q / DQN / Rainbow | Replay, target nets, prioritized, distributional | `dqn_cartpole.py --episodes 400` |\n| 03 Policy Methods | REINFORCE / PG | Return estimation, baselines, continuous actions | `policy_gradient_pendulum.py --episodes 100` |\n| 04 Actor-Critic | **PPO** / **TD3** / A2C / SAC / TRPO | Industry-standard algorithms, trust regions | `ppo_cartpole.py --episodes 100` ⭐ |\n| 05 Advanced RL | Evolution, Curiosity, Multi-agent | Exploration bonuses, population search | `evolutionary_cartpole.py --generations 5` |\n| 06 Industry Cases | Applied RL | Energy, bidding, recommendation framing | `realtime_bidding_qlearning.py --episodes 500` |\n| 07 Operationalization | **Offline RL**, **RLHF**, Deployment | CQL, IQL, Dreamer, distributed training, TorchServe | `cql_offline_rl.py --mode compare` ⭐  |\n\nEach module folder includes `content.md` (theory + checklist) and an `examples/` directory of runnable scripts.\n\n---\n\n## 5.1. Algorithm Selection Guide - Which Algorithm Should I Use?\n\nUse this decision tree to quickly find the right algorithm for your problem:\n\n```\n┌─────────────────────────────────────────────────────┐\n│  What type of ACTION SPACE do you have?            │\n└──────────────────┬──────────────────────────────────┘\n                   │\n         ┌─────────┴─────────┐\n         │                   │\n    DISCRETE             CONTINUOUS\n    (e.g., 4 actions)    (e.g., torque, velocity)\n         │                   │\n         │                   │\n┌────────┴────────┐    ┌─────┴────────┐\n│                 │    │              │\n│  Do you have    │    │ Do you need  │\n│  offline data?  │    │ exploration? │\n│                 │    │              │\n└───┬─────────┬───┘    └───┬──────┬───┘\n    │         │            │      │\n   YES       NO           YES     NO\n    │         │            │      │\n    │         │            │      │\n    │    ┌────┴─────┐      │      │\n    │    │          │      │      │\n    │  On-Policy Off-Policy│      │\n    │    │          │      │      │\n    ▼    ▼          ▼      ▼      ▼\n  ┌─────────────────────────────────────────┐\n  │  RECOMMENDED ALGORITHMS                 │\n  ├─────────────────────────────────────────┤\n  │  Discrete + Offline    → CQL/IQL        │\n  │  Discrete + On-Policy  → PPO ⭐          │\n  │  Discrete + Off-Policy → DQN/Rainbow    │\n  │  Continuous + Explore  → SAC ⭐          │\n  │  Continuous + Exploit  → TD3 ⭐          │\n  │  Multi-Armed Bandits   → ε-Greedy       │\n  │  Model-Based          → DreamerV3       │\n  └─────────────────────────────────────────┘\n```\n\n### Quick Reference Table\n\n| Your Situation | Best Algorithm | File to Run | Why? |\n|----------------|----------------|-------------|------|\n| **Just starting RL** | PPO | `ppo_cartpole.py` | Most stable, widely used, fast training |\n| **Production deployment** | PPO or TD3 | `ppo_lunarlander.py` or `td3_pendulum.py` | Industry standards (ChatGPT uses PPO) |\n| **Continuous control (robotics)** | TD3 or SAC | `td3_pendulum.py` | State-of-the-art for continuous actions |\n| **Discrete actions (games)** | PPO or DQN | `ppo_cartpole.py` or `dqn_cartpole.py` | PPO for stability, DQN for sample efficiency |\n| **Limited data (offline RL)** | CQL or IQL | Module 07 examples | Learn from fixed datasets |\n| **Exploration needed** | SAC or Curiosity | `sac_robotic_arm.py` | Maximum entropy or intrinsic rewards |\n| **Sample efficiency critical** | Model-based (DreamerV3) | Future implementation | Learns world model, imagines trajectories |\n| **Learning the theory** | Start with Bandits → Q-Learning → Policy Gradient → PPO | Follow Module 01-04 | Progressive difficulty |\n\n### Algorithm Family Tree\n\n```\nReinforcement Learning Algorithms\n│\n├── Value-Based (Learn Q(s,a))\n│   ├── Tabular\n│   │   └── Q-Learning ..................... Module 02\n│   └── Deep\n│       ├── DQN ............................ Module 02 ⭐\n│       ├── Double DQN ..................... Module 02\n│       ├── Dueling DQN .................... Module 02\n│       └── Rainbow DQN .................... Module 02\n│\n├── Policy-Based (Learn π(a|s))\n│   ├── REINFORCE .......................... Module 03\n│   ├── Policy Gradient .................... Module 03\n│   └── Evolutionary Strategies ............ Module 05\n│\n├── Actor-Critic (Learn both π and V/Q)\n│   ├── On-Policy\n│   │   ├── A2C ............................ Module 04\n│   │   ├── PPO ............................ Module 04 ⭐⭐⭐ [INDUSTRY STANDARD]\n│   │   └── TRPO ........................... Module 04\n│   └── Off-Policy\n│       ├── DDPG ........................... (TD3 is better)\n│       ├── TD3 ............................ Module 04 ⭐⭐⭐ [INDUSTRY STANDARD]\n│       └── SAC ............................ Module 04 ⭐⭐\n│\n├── Model-Based (Learn environment model)\n│   ├── DreamerV3 .......................... Module 07 ⭐⭐ \n│   └── MuZero ............................. (Future)\n│\n├── Offline RL (Learn from fixed datasets)\n│   ├── CQL (Conservative Q-Learning) ...... Module 07 ⭐⭐⭐ \n│   ├── IQL (Implicit Q-Learning) .......... Module 07 ⭐⭐⭐ \n│   └── Behavioral Cloning ................. Module 07\n│\n└── Exploration \u0026 Advanced\n    ├── Multi-Armed Bandits ................ Module 01 ⭐ [START HERE]\n    ├── Curiosity-Driven ................... Module 05\n    ├── Multi-Agent ........................ Module 05\n    └── RLHF (for LLMs) .................... Module 07 ⭐⭐⭐ \n\n⭐ = Beginner-friendly\n⭐⭐ = Production-ready\n⭐⭐⭐ = Industry standard (2024-2025)\n```\n\n### When to Use Each Algorithm (Practical Decision Guide)\n\n**Use PPO when:**\n- ✅ You want the safest, most reliable choice\n- ✅ You're deploying to production (ChatGPT, AlphaStar use this)\n- ✅ You have either discrete OR continuous actions\n- ✅ You can afford to collect fresh data for each update\n- ✅ Training stability \u003e sample efficiency\n\n**Use TD3 when:**\n- ✅ You have continuous action spaces (robotics, control)\n- ✅ Sample efficiency matters (expensive simulations)\n- ✅ You can use a replay buffer (store past experiences)\n- ✅ You need deterministic policies\n- ✅ You're benchmarking against research papers\n\n**Use SAC when:**\n- ✅ You have continuous actions + need exploration\n- ✅ Maximum sample efficiency is critical\n- ✅ Environment is stochastic (benefits from entropy)\n- ✅ You want automatic temperature tuning\n- ✅ Robustness to hyperparameters is important\n\n**Use DQN when:**\n- ✅ You have discrete actions (simple games)\n- ✅ You want to learn from a replay buffer\n- ✅ You understand value-based methods\n- ✅ PPO is overkill for your simple problem\n\n**Use Multi-Armed Bandits when:**\n- ✅ You're just starting with RL (great introduction!)\n- ✅ You have a stateless decision problem\n- ✅ You need exploration strategies (ε-greedy, UCB)\n- ✅ A/B testing, recommendation systems\n\n**Use Model-Based (DreamerV3) when:**\n- ✅ Sample efficiency is CRITICAL (very expensive data)\n- ✅ You can learn an accurate world model\n- ✅ You want to plan ahead via imagination\n- ✅ You have access to GPU resources\n\n---\n\n## 5.2. Visual Diagrams\n\n📊 **Comprehensive visual guides available in [`docs/diagrams/`](docs/diagrams/)**\n\nWe've created professional diagrams to help you understand the project structure and choose the right algorithms:\n\n### 🎨 Available Visualizations\n\n1. **[Algorithm Taxonomy](docs/diagrams/algorithm_taxonomy.png)** - Complete overview of all 7 modules with algorithms color-coded by category\n2. **[Learning Path](docs/diagrams/learning_path.png)** - Step-by-step progression from beginner to production-ready engineer\n3. **[Algorithm Selection Tree](docs/diagrams/algorithm_selection_tree.png)** - Decision tree to help you choose the right algorithm\n4. **[Project Features Comparison](docs/diagrams/project_features_comparison.png)** - How RL 101 compares to typical tutorials\n5. **[Algorithm Network](docs/diagrams/algorithm_network.png)** - Visual representation of algorithm relationships and dependencies\n\n**Generate diagrams yourself:**\n```bash\npython scripts/generate_project_diagrams.py\n```\n\nSee [`docs/diagrams/README.md`](docs/diagrams/README.md) for detailed descriptions and usage guidelines.\n\n---\n\n## 6. Project Layout\n```\nmodules/                # Module theory + examples (core learning path)\ndocker/                 # Dockerfiles, docker-compose.yml, run scripts\nscripts/                # Smoke tests \u0026 utilities\nrequirements/           # Modular requirements (base, CPU, CUDA, ROCm)\nsetup.sh               # Smart setup script with GPU auto-detection\nSETUP.md               # Comprehensive setup guide\n```\n\n## 7. Running Examples (More Highlights)\nBandits / Intro (NumPy only):\n```bash\npython modules/module_01_intro/examples/bandit_epsilon_greedy.py --arms 5 --steps 1000 --epsilon 0.05\npython modules/module_01_intro/examples/bandit_ucb.py --arms 10 --steps 2000 --c 2.0  # UCB exploration\n```\n\nValue Methods:\n```bash\npython modules/module_02_value_methods/examples/q_learning_cartpole.py --episodes 300\npython modules/module_02_value_methods/examples/rainbow_atari.py --episodes 100  # Needs Atari env \u0026 ROM legality\n```\n\nPolicy \u0026 Actor-Critic:\n```bash\npython modules/module_03_policy_methods/examples/reinforce_cartpole.py --episodes 300\npython modules/module_04_actor_critic/examples/sac_robotic_arm.py --episodes 200\n```\n\nAdvanced / Exploration:\n```bash\npython modules/module_05_advanced_rl/examples/curiosity_supermario.py --episodes 50  # External assets may be required\npython modules/module_05_advanced_rl/examples/multiagent_gridworld.py --episodes 200\n```\n\nIndustry \u0026 Ops:\n```bash\npython modules/module_06_industry_cases/examples/energy_optimization_dqn.py --episodes 300\npython modules/module_07_operationalization/examples/torchserve_inference.py --model-path ./models/\n```\n\n**⭐ NEW: Advanced Algorithms (2024-2025)**\n```bash\n# Offline RL - Learn from fixed datasets (no environment interaction!)\npython modules/module_07_operationalization/examples/cql_offline_rl.py --mode compare --dataset-path data/cartpole.pkl\npython modules/module_07_operationalization/examples/iql_offline_rl.py --mode compare --dataset-path data/cartpole.pkl\n\n# Model-Based RL - Train policy in imagination\npython modules/module_07_operationalization/examples/dreamer_model_based.py --env CartPole-v1 --episodes 200\n\n# RLHF - Language model alignment (like ChatGPT)\npython modules/module_07_operationalization/examples/rlhf_text_generation.py --task sentiment --iterations 100\n\n# Infrastructure - Distributed training \u0026 hyperparameter tuning\npython modules/module_07_operationalization/examples/ray_distributed_ppo.py --num-workers 4\npython modules/module_07_operationalization/examples/hyperparameter_tuning_optuna.py --n-trials 50\n\n# Benchmark Suite - Compare algorithms\npython modules/module_07_operationalization/examples/benchmark_suite.py --env CartPole-v1 --algorithms dqn ppo\n```\n\nUse smaller numbers (`--episodes 5`, `--generations 1`, tiny populations) for dry runs.\n\n## 8. Environment \u0026 Reproducibility\nSet seeds (many scripts expose `--seed`):\n```bash\npython modules/module_02_value_methods/examples/dqn_cartpole.py --episodes 50 --seed 42\n```\nDesign tenets (consolidated below in [Best Practices](#9-best-practices)):\n* Deterministic where feasible (seeding PyTorch, NumPy, env wrappers)\n* Structured logging with `rich` for human scan + copy/paste\n* Explicit CLI flags over hidden config files\n* Separate environment creation from learning logic\n* Incremental complexity; minimal runnable baseline first\n\n## 9. Best Practices\n\nGuidelines distilled from maintaining RL101 in production-like environments.\n\n### Development Environment\n\n- **Use `setup.sh`** for automated provisioning with GPU auto-detection.\n- **Prefer `.venv`** when working natively to isolate dependencies.\n- **Containerize for consistency** using the curated Dockerfiles when collaborating or deploying.\n- **Install modular requirements**: start with `requirements-base.txt`, then add the PyTorch variant that matches your hardware.\n\n```bash\n# Base dependencies (always needed)\npip install -r requirements/requirements-base.txt\n\n# Choose a PyTorch flavor\npip install -r requirements/requirements-torch-cpu.txt   # CPU-only\npip install -r requirements/requirements-torch-cuda.txt  # NVIDIA GPU\npip install -r requirements/requirements-torch-rocm.txt  # AMD GPU\n```\n\n### Code Quality\n\n**Reproducibility**\n\n- Seed NumPy, PyTorch, and Gymnasium environments whenever deterministic comparisons matter.\n- Expose a `--seed` flag on new scripts; document stochastic behavior when determinism is infeasible.\n\n**Logging \u0026 Output**\n\n- Use `rich` for structured console output instead of plain `print`.\n- Capture core metrics (reward, loss, epsilon/temperature, entropy) and surface them per episode.\n- Employ `rich.progress` for long-running loops to track momentum without spamming logs.\n\n**Configuration Management**\n\n- Stick to a CLI-first design with `argparse`.\n- Favor explicit flags over hidden configuration files; ship scripts with sane defaults.\n- Craft comprehensive `--help` text so users can discover knobs quickly.\n\n**Code Organization**\n\n- Separate environment setup, learning logic, and evaluation/serving into distinct functions or modules.\n- Keep functions focused; add type hints where it aids readability.\n- Include top-of-file docstrings that state the goal and show a sample command.\n\n**Dependency Management**\n\n- Guard optional imports and fail gracefully with actionable guidance:\n\n```python\ntry:\n    import torch\nexcept ImportError:\n    console.print(\"[red]PyTorch required. Install: pip install torch[/red]\")\n    sys.exit(1)\n```\n\n**Incremental Complexity**\n\n- Start from a minimal working agent before layering advanced tricks.\n- Add enhancements one at a time, validating behavior (and performance) at each step.\n- Capture the “why” in comments—especially when you diverge from textbook algorithms.\n\n### Testing \u0026 Validation\n\n- Ship defaults that finish within minutes so contributors can iterate quickly.\n- Support tiny dry-run parameters (e.g., `--episodes 5`, `--generations 1`).\n- Run smoke tests before commits touching shared code:\n\n```bash\n# Run all tests (comprehensive)\npython scripts/smoke_test.py\n\n# Quick validation (core only, ~30 seconds)\npython scripts/smoke_test.py --core-only\n\n# Skip optional tests (infrastructure, advanced)\npython scripts/smoke_test.py --skip-optional\n\n# Test specific groups\npython scripts/smoke_test.py --group core\npython scripts/smoke_test.py --group deep-rl\npython scripts/smoke_test.py --group infrastructure\npython scripts/smoke_test.py --group advanced\n\n# Verbose output for debugging\npython scripts/smoke_test.py --verbose\n```\n\n- When possible, verify changes across CPU, CUDA, and ROCm configurations—Docker images help here.\n\n### Performance Tips\n\n**Native Environment**\n\n- Match your dependency footprint to your hardware; CPU-only wheels keep things lean.\n- Isolate work in a virtual environment to dodge global site-packages conflicts.\n\n**Docker Environment**\n\n- Optimize Dockerfiles for layer caching (requirements before code) to speed rebuilds.\n- Mount a pip cache volume when rebuilding frequently.\n- Stick with minimal base images unless you absolutely need a heavier stack.\n\n**Runtime Optimization**\n\n- Profile before prematurely optimizing; use PyTorch/TensorBoard profilers to find hot spots.\n- Prefer vectorized NumPy/PyTorch ops over Python loops.\n- Manage GPU memory proactively (`torch.cuda.empty_cache()`) when experimenting with large models.\n\n## 10. GPU \u0026 Docker\n\n### Docker Setup\nPrefer Docker for reproducible environments and hassle-free GPU support:\n\n```bash\n# Automated (auto-detects GPU)\n./setup.sh docker\n\n# Manual selection\nbash docker/run.sh cpu     # CPU-only (lightweight ~500MB)\nbash docker/run.sh cuda    # NVIDIA CUDA 12.9 + PyTorch 2.8\nbash docker/run.sh rocm    # AMD ROCm 6.x + PyTorch\n```\n\n### Native GPU Setup\nFor native installations with GPU support:\n\n**NVIDIA CUDA:**\n```bash\npip install -r requirements/requirements-base.txt\npip install -r requirements/requirements-torch-cuda.txt\n```\n\n**AMD ROCm:**\n```bash\npip install -r requirements/requirements-base.txt\npip install -r requirements/requirements-torch-rocm.txt\n```\n\n**Verify GPU:**\n```bash\npython -c \"import torch; print(f'CUDA: {torch.cuda.is_available()}')\"\n```\n\nInside Docker containers, the repo is mounted at `/workspace`. Run scripts directly without additional setup.\n\n## 11. Dependencies\n\n### Requirements Structure\n```\nrequirements/\n├── requirements-base.txt        # Core: NumPy, Rich, Gymnasium, TensorBoard\n├── requirements-torch-cpu.txt   # PyTorch CPU-only (~500MB)\n├── requirements-torch-cuda.txt  # PyTorch with CUDA support\n└── requirements-torch-rocm.txt  # PyTorch with ROCm (AMD GPU)\n```\n\n**Core (always installed):** `numpy\u003e=1.24`, `rich\u003e=13.7`, `gymnasium[classic-control]\u003e=1.0`, `tensorboard\u003e=2.16`\n\n**PyTorch (choose one):**\n- CPU-only: Lightweight, works everywhere\n- CUDA: NVIDIA GPUs (requires CUDA 12.x drivers)\n- ROCm: AMD GPUs (requires ROCm 6.x drivers)\n\n### Python Version Policy\n* **Recommended:** Python 3.11 (best PyTorch wheel availability)\n* **Supported:** 3.11 and 3.12 (PyTorch wheels available and tested)\n* **Minimum:** 3.11 (earlier versions not supported)\n* **Note:** Python 3.13+ is currently unsupported by PyTorch wheels—use Docker or downgrade your interpreter\n\nIf a script requires PyTorch and it's missing, it exits with clear guidance.\n\n## 12. Testing \u0026 Fast Validation\n\n### Comprehensive Smoke Tests\nThe project includes an intelligent test suite organized by dependency requirements:\n\n```bash\n# Run all tests (4 groups: core, deep-rl, infrastructure, advanced)\npython scripts/smoke_test.py\n\n# Quick test - core examples only (no PyTorch, ~30 seconds)\npython scripts/smoke_test.py --core-only\n\n# Skip optional tests (faster CI/CD)\npython scripts/smoke_test.py --skip-optional\n\n# Test specific groups\npython scripts/smoke_test.py --group core           # NumPy-based examples\npython scripts/smoke_test.py --group deep-rl        # PyTorch-based RL\npython scripts/smoke_test.py --group infrastructure # GPU, distributed, tracking\npython scripts/smoke_test.py --group advanced       # Offline RL, RLHF, Dreamer\n\n# Verbose output for debugging\npython scripts/smoke_test.py --verbose\n```\n\n**Test Groups:**\n- **Core Examples**: Multi-armed bandits, tabular RL (no PyTorch required)\n- **Deep RL Examples**: DQN, PPO, TD3, SAC, TRPO (requires PyTorch)\n- **Infrastructure**: GPU optimization, Ray RLlib, Optuna (optional dependencies; includes the slower `sac_robotic_arm.py` quick-run variant)\n- **Advanced Algorithms**: CQL, IQL, Dreamer, RLHF (cutting-edge research)\n\nWhen you launch the suite it announces which optional dependencies are available (PyTorch, Ray, Optuna) and marks missing ones clearly. Tests that depend on unavailable extras are counted as skipped rather than failed, so you can still get a green run on slim environments. Box2D-driven workloads (e.g., `ppo_lunarlander.py`) auto-skip when `Box2D` isn’t installed, keeping core validation snappy.\n\n### Quick Algorithm Validation\n```bash\n# Test Phase 1: Core algorithms (2 minutes each)\npython modules/module_04_actor_critic/examples/ppo_cartpole.py --episodes 5\npython modules/module_04_actor_critic/examples/td3_pendulum.py --episodes 5\n\n# Test Phase 3: Advanced algorithms (1 minute each)\npython modules/module_07_operationalization/examples/cql_offline_rl.py --mode generate --dataset-size 1000\npython modules/module_07_operationalization/examples/benchmark_suite.py --trials 1 --episodes 2\n```\n\n## 13. Troubleshooting\n| Symptom | Likely Cause | Fix |\n|---------|--------------|-----|\n| ImportError: torch | Not installed / wrong Python version | `pip install torch` or use Docker |\n| Extremely slow training | Running on CPU with large model | Reduce network size / episodes; try GPU container |\n| Atari env fails | ROM / ALE dependency missing | Install appropriate gymnasium extras; ensure legal ROM acquisition |\n| Non-deterministic returns | Env stochasticity | Set `--seed`, limit parallelism, check gymnasium version |\n\n## 14. Extending the Project\nAdd a new example script under the appropriate module's `examples/` and follow existing patterns:\n* Top docstring: purpose + minimal usage\n* `argparse` flags with sane defaults\n* Seed handling (`--seed`)\n* Clear separation: model definition, experience gathering, update step\n* Log episodic reward + key diagnostics (loss, epsilon, entropy, etc.)\n\n## 15. Contributing\nSee `CONTRIBUTING.md`.\nPrinciples:\n* Keep runtimes short by default (fast smoke params)\n* Avoid heavy hidden dependencies; guard imports\n* Favor clarity over cleverness—this is a teaching repo\n* Log roadmap-impacting ideas via GitHub Issues or Discussions so the community can weigh in\n\n## 16. Roadmap\n\n### Completed ✅\n\n#### Core Algorithms\n- ✅ Expand DQN to Double/Dueling/Prioritized Replay\n- ✅ Add Rainbow Atari example\n- ✅ Add policy gradient examples (REINFORCE, Pendulum)\n- ✅ Add A2C and SAC examples\n- ✅ **Add industry-standard algorithms (PPO, TD3, TRPO)** ⭐ **Phase 1 Complete**\n- ✅ Add advanced topics (curiosity, multi-agent)\n- ✅ **Add cutting-edge algorithms (CQL, IQL, Dreamer, RLHF)** ⭐ **Phase 3 Complete**\n\n#### Infrastructure \u0026 Setup\n- ✅ Modular requirements structure (base, CPU, CUDA, ROCm)\n- ✅ Automated setup script with GPU auto-detection (`setup.sh`)\n- ✅ Optimized Dockerfiles (CPU: python:3.11-slim, CUDA/ROCm: official bases)\n- ✅ Enhanced docker-compose.yml with pip caching and proper GPU configs\n- ✅ Comprehensive setup documentation (`SETUP.md`)\n- ✅ Updated `CONTRIBUTING.md` with development guidelines\n- ✅ **Intelligent smoke test suite with dependency-based grouping** ⭐ **Phase 2 Complete**\n- ✅ **GPU optimization (vectorized envs, mixed precision)** ⭐ **Phase 2 Complete**\n- ✅ **Distributed training (Ray RLlib)** ⭐ **Phase 2 Complete**\n- ✅ **Hyperparameter tuning (Optuna)** ⭐ **Phase 2 Complete**\n- ✅ **TensorBoard integration** ⭐ **Phase 2 Complete**\n\n### In Progress 🔨\n- Docker multi-platform builds (ARM64 support)\n- Automated Docker image publishing to registry\n- Pre-commit hooks for code quality\n\n### Future Enhancements 🚀\n\n#### Algorithms\n- ~~Add more advanced algorithms (PPO, TRPO, TD3)~~ ✅ **DONE (Phase 1)**\n- ~~Model-based RL examples (Dreamer)~~ ✅ **DONE (Phase 3)**\n- ~~Offline RL (CQL, IQL)~~ ✅ **DONE (Phase 3)**\n- ~~RLHF for LLMs~~ ✅ **DONE (Phase 3)**\n- MuZero implementation\n- Meta-RL and few-shot learning examples\n- Hierarchical RL implementations\n\n#### Multi-Agent \u0026 Competition\n- Expand multi-agent scenarios (competitive environments)\n- Self-play implementations\n- Communication protocols in multi-agent systems\n- Tournament and leaderboard systems\n\n#### Industry Applications\n- Add more industry case studies (finance, healthcare, robotics)\n- Real-world deployment patterns\n- A/B testing frameworks for RL policies\n- Cost optimization and budget constraints\n\n#### Tools \u0026 Infrastructure\n- Enhanced visualization and debugging tools\n- ~~Integration with popular RL frameworks (Ray RLlib)~~ ✅ **DONE (Phase 2)**\n- ~~Performance benchmarking suite~~ ✅ **DONE (Phase 3)**\n- ~~Hyperparameter optimization examples (Optuna, Ray Tune)~~ ✅ **DONE (Phase 2)**\n- ~~Distributed training examples~~ ✅ **DONE (Phase 2)**\n- Cloud deployment guides (AWS, GCP, Azure)\n- Kubernetes production deployment examples\n\n#### Documentation \u0026 Learning\n- Interactive tutorials and exercises\n- Video walkthroughs for each module\n- Jupyter notebook variants (optional, for exploratory learning)\n- Algorithm comparison benchmarks\n- Common pitfalls and debugging guide\n\n## 17. References\n\n- Sutton \u0026 Barto: *Reinforcement Learning: An Introduction*\n- OpenAI: *Spinning Up in Deep RL*\n- Gymnasium documentation\n- Stable-Baselines3 documentation\n- CleanRL\n\n## 18. License\nLicensed under the **Apache License, Version 2.0** (see `LICENSE`). You may use, modify, and distribute this project under the terms of that license. Please retain instructional comments where practical to preserve educational value.\n\n## 19. Citation (Optional)\nIf this helped your study or project, consider citing:\n```\n@misc{rl101tutorial,\n\ttitle  = {Reinforcement Learning 101: A Progressive Hands-On Project},\n    author = {Stephen Shao}\n\tyear   = {2025},\n\thowpublished = {GitHub repository},\n\turl    = {https://github.com/AIComputing101/reinforcement-learning-101}\n}\n```\n\n## 20. FAQ\nQ: Why no notebooks?  \nA: Scripts enforce explicit structure, easier diffing, and production parity. You can still adapt them into notebooks if desired.\n\nQ: Where are pretrained weights?  \nA: Intentionally omitted to nudge you to train; add caching if you extend.\n\nQ: How long do examples take?  \nA: Baselines aim for a few minutes on CPU; scale episodes upward only after verifying flow.\n\n---\nHappy learning \u0026 experimenting. PRs welcome!","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Faicomputing101%2Freinforcement-learning-101","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Faicomputing101%2Freinforcement-learning-101","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Faicomputing101%2Freinforcement-learning-101/lists"}