{"id":48578204,"url":"https://github.com/scthornton/infiniband","last_synced_at":"2026-04-08T16:04:14.778Z","repository":{"id":300665583,"uuid":"1006764622","full_name":"scthornton/infiniband","owner":"scthornton","description":"Infiniband notes and study guide","archived":false,"fork":false,"pushed_at":"2026-03-25T14:42:38.000Z","size":19,"stargazers_count":5,"open_issues_count":0,"forks_count":1,"subscribers_count":0,"default_branch":"main","last_synced_at":"2026-03-26T17:20:33.886Z","etag":null,"topics":["infiniband","nvidia","routing","routing-engine","study-notes"],"latest_commit_sha":null,"homepage":"","language":null,"has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":null,"status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/scthornton.png","metadata":{"files":{"readme":"README.md","changelog":"CHANGELOG.md","contributing":null,"funding":null,"license":null,"code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":"SECURITY.md","support":null,"governance":null,"roadmap":null,"authors":null,"dei":null,"publiccode":null,"codemeta":null,"zenodo":null,"notice":null,"maintainers":null,"copyright":null,"agents":null,"dco":null,"cla":null}},"created_at":"2025-06-23T00:30:04.000Z","updated_at":"2026-03-25T14:43:17.000Z","dependencies_parsed_at":"2025-06-23T01:39:10.642Z","dependency_job_id":null,"html_url":"https://github.com/scthornton/infiniband","commit_stats":null,"previous_names":["scthornton/infiniband"],"tags_count":0,"template":false,"template_full_name":null,"purl":"pkg:github/scthornton/infiniband","repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/scthornton%2Finfiniband","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/scthornton%2Finfiniband/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/scthornton%2Finfiniband/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/scthornton%2Finfiniband/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/scthornton","download_url":"https://codeload.github.com/scthornton/infiniband/tar.gz/refs/heads/main","sbom_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/scthornton%2Finfiniband/sbom","scorecard":null,"host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":286080680,"owners_count":31562700,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2026-04-08T14:31:17.711Z","status":"ssl_error","status_checked_at":"2026-04-08T14:31:17.202Z","response_time":54,"last_error":"SSL_connect returned=1 errno=0 peeraddr=140.82.121.5:443 state=error: unexpected eof while reading","robots_txt_status":"success","robots_txt_updated_at":"2025-07-24T06:49:26.215Z","robots_txt_url":"https://github.com/robots.txt","online":false,"can_crawl_api":true,"host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["infiniband","nvidia","routing","routing-engine","study-notes"],"created_at":"2026-04-08T16:04:14.687Z","updated_at":"2026-04-08T16:04:14.772Z","avatar_url":"https://github.com/scthornton.png","language":null,"funding_links":[],"categories":[],"sub_categories":[],"readme":"# InfiniBand Study Guide and Notes\n\n## Table of Contents\n1. [What is InfiniBand?](#what-is-infiniband)\n2. [InfiniBand Architecture](#infiniband-architecture)\n3. [Key Features](#key-features)\n4. [Network Components](#network-components)\n5. [Bandwidth Evolution](#bandwidth-evolution)\n6. [NVIDIA Products](#nvidia-products)\n7. [Customer Applications](#customer-applications)\n8. [Routing Engines Deep Dive](#routing-engines-deep-dive)\n9. [Troubleshooting Common Issues](#troubleshooting-common-issues)\n10. [Essential Commands and Tools](#essential-commands-and-tools)\n11. [Study Tips \u0026 Practice Questions](#study-tips--practice-questions)\n\n---\n\n## What is InfiniBand?\n\n### Core Definition\nInfiniBand is an industry-standard specification that defines a high-performance input/output architecture for interconnecting compute nodes, storage systems, and networking equipment.\n\n### Comparison with Traditional Networking\n- **Traditional TCP/IP**: Software-based transport with higher CPU overhead and potential packet loss\n- **InfiniBand**: Hardware-based transport with dedicated pathways, no packet loss, and intelligent traffic management\n\n### Key Characteristics\n- **OS Independent**: Works seamlessly with Linux, Windows, and ESXi\n- **Application-Centered**: Applications communicate directly without heavy OS involvement\n- **Hardware-Based**: Transport protocols are implemented in hardware for maximum performance\n\n---\n\n## InfiniBand Architecture\n\n### The Five-Layer Model\nInfiniBand's architecture consists of five distinct layers, each with specific responsibilities:\n\n#### 🏢 Upper Layer (Applications)\n**Purpose**: Interface between applications and InfiniBand services\n\n**Key Protocols**:\n- **MPI (Message Passing Interface)** - Standard for parallel computing in HPC\n- **NCCL (NVIDIA Collective Communication Library)** - For AI/ML workloads\n- **RDMA Storage Protocols** - Including SRP, iSER, NFS-RDMA\n- **IP over InfiniBand (IPoIB)** - Runs standard IP applications over InfiniBand\n- **SDP (Sockets Direct Protocol)** - Provides socket-like interface with RDMA benefits\n\n#### 🚛 Transport Layer\n**Purpose**: Hardware-based transport services\n\n**Key Feature**: CPU offloading through **RDMA (Remote Direct Memory Access)**\n\n**Three Key Technologies**:\n1. **Hardware-based transport**: Network card handles protocol processing\n2. **Kernel bypass**: Data doesn't go through operating system\n3. **RDMA**: Direct memory-to-memory transfer between computers\n\n**Advantage**: Messages go directly from one application's memory to another's, completely bypassing the CPU\n\n#### 🗺️ Network Layer\n**Purpose**: Routing between different InfiniBand subnets\n\n**Uses**: Global IDs (GIDs) for addressing\n\n**Key Header**: **Global Routing Header (GRH)** - used when packets need to cross subnet boundaries\n\n**Function**: Connects multiple InfiniBand subnets via routers\n\n#### 🚦 Link Layer\n**Purpose**: Packet switching within a subnet using Local IDs (LIDs)\n\n**Key Features**:\n- **LID-based Switching**: Packets forwarded based on destination LID\n- **Credit-based Flow Control**: Prevents packet loss by tracking receiver buffer space\n- **Lossless Networking**: Packets are never dropped during normal operation\n- **Switch Forwarding**: Tables map destination LIDs to exit ports\n\n**Critical Concept**: This layer ensures lossless packet delivery\n\n#### ⚡ Physical Layer\n**Purpose**: Signal transmission over physical medium\n\n**Key Responsibilities**:\n- **Signal Integrity**: Ensures reliable electrical/optical signal transmission\n- **Bit Encoding**: How bits are placed on the wire\n- **Cable Specifications**: Defines characteristics for copper and optical cables\n- **Valid Packet Detection**: Signaling protocol to identify valid packets\n\n**Lane Aggregation**: Multiple physical lanes combined for higher bandwidth\n- **Formula**: Total bandwidth = single lane speed × number of lanes\n\n### Communication Flow Process\nData transmission through the layers:\n1. **Upper Layer**: Application initiates data transfer\n2. **Transport Layer**: Automated protocol handling and RDMA processing\n3. **Network Layer**: Route determination for inter-subnet communication\n4. **Link Layer**: Local subnet traffic management and switching\n5. **Physical Layer**: Physical signal transmission\n\n---\n\n## Key Features\n\n### 🔧 Simplified Management\n**Subnet Manager (SM)**: Centralized network management\n\n**Benefits**: Plug-and-play functionality, automatic routing\n\n**Key Functions**:\n- **LID Assignment**: Assigns Local IDs to ALL nodes (servers and switches)\n- **Plug-and-Play**: Enables automatic node discovery and configuration\n- **Routing Tables**: Calculates and programs forwarding tables in switches\n- **Topology Discovery**: Maps the entire subnet topology\n- **Redundancy**: Master and standby SM for reliability\n- **Scope**: Manages up to 48,000 nodes in a single subnet\n\n### 🚀 High Bandwidth\n**Evolution Timeline**:\n- 2002: 10 Gb/s (Single Data Rate)\n- 2008: 40 Gb/s (Quadruple Data Rate)\n- 2011: 56 Gb/s (Fourteen Data Rate)\n- 2015: 100 Gb/s (Enhanced Data Rate)\n- 2018: 200 Gb/s (High Data Rate)\n- 2021: 400 Gb/s (Next Data Rate)\n\n**Formula**: Link rate = single lane speed × number of lanes\n\n### 💨 Ultra-Low Latency\n**Performance**: As low as 1 microsecond (1,000 nanoseconds)\n\n**Achievement**: Hardware offloading eliminates software processing delays\n\n### 🔄 CPU Offloads\nThe RDMA mechanism spans both **Transport \u0026 Upper layers** - implemented through BTH (Base Transport Header) and ETH (Extended Transport Header).\n\n### 📈 Easy Network Scale-Out\n- **Single subnet**: Up to 48,000 nodes\n- **Multiple subnets**: Unlimited scaling with routers\n\n### 🎯 Quality of Service (QoS)\n**Function**: Prioritizes different types of traffic\n\n**Implementation**: Different applications mapped to different priority queues\n\n**Special VL**: **VL15 is exclusively reserved for subnet management traffic**\n\n### 🛡️ Fabric Resiliency\n- **Standard Recovery**: ~5 seconds (Subnet Manager recalculation)\n- **NVIDIA Self-Healing**: ~1 millisecond (5000x faster!)\n\n### ⚖️ Adaptive Routing\n**Function**: Dynamically balances traffic across available paths\n\n**Manager**: Queue Manager monitors port utilization\n\n### 🔥 SHARP (Scalable Hierarchical Aggregation and Reduction Protocol)\n**Purpose**: Performs collective operations in network switches instead of endpoints\n\n**Benefit**: Up to 10x performance improvement for MPI applications\n\n---\n\n## Network Components\n\n### 🔌 Host Channel Adapters (HCAs)\n**Function**: Connect hosts to InfiniBand network\n\n**Example**: ConnectX-6 VPI cards (up to 200 Gb/s)\n\n**VPI Note**: VPI (Virtual Protocol Interconnect) cards can operate in either InfiniBand mode or Ethernet mode\n\n### 🔀 InfiniBand Switches\n- **Edge Switches**: QM8700 series (40 ports, 200 Gb/s each)\n- **Director Switches**: CS8500 series (800 ports, complete Fat-Tree in chassis)\n\n### 🌉 Gateways and Routers\n- **Gateway**: Connects InfiniBand to Ethernet networks\n- **Router**: Connects multiple InfiniBand subnets\n- **Example**: Mellanox Skyway gateway\n\n**Key Distinction**: \n- Routers connect InfiniBand subnets (uses GIDs)\n- Gateways connect InfiniBand to Ethernet\n\n### 🔗 Cables\n- **DAC (Direct Attach Copper)**: Short distances, cost-effective\n- **AOC (Active Optical Cable)**: Longer distances, higher performance\n\n---\n\n## Topologies\n\nInfiniBand supports multiple network topologies:\n\n### 🌳 Fat Tree (Most Common)\n- **Structure**: Hierarchical tree with increasing bandwidth toward the top\n- **Benefit**: Non-blocking bandwidth between any two nodes\n\n### 🍩 Torus\n- **Structure**: Nodes arranged in a grid with wraparound connections\n- **Benefit**: Multiple paths between nodes\n\n### 🐉 Dragonfly+\n- **Structure**: Hierarchical with local and global groups\n- **Benefit**: Efficient for large-scale deployments\n\n---\n\n## Routing Engines Deep Dive\n\n### Types of Routing Engines\n\n#### **Min-hop Routing**\n- **Default algorithm** - invoked automatically if no routing algorithm is specified\n- Finds the shortest path between nodes\n- Simple and efficient for basic topologies\n\n#### **Up-Down Routing**\n- **Restriction**: Does not allow paths that go \"down\" (away from the roots) and then \"up\" (toward the roots)\n- **Requirement**: Needs clearly defined root nodes to establish hierarchy\n- **Fallback**: If no root nodes are found, OpenSM falls back to Min-Hop routing\n\n#### **Fat-tree Routing**\n- **Best for**: Symmetrical or almost symmetrical fat-tree topologies\n- **Self-healing capability**: Can reroute around multiple cable failures and complete switch failures\n- **Structure**: Hierarchical tree with increasing bandwidth toward the top\n- **Benefit**: Non-blocking bandwidth between any two nodes\n\n#### **Adaptive Routing**\n- **Advanced feature**: Dynamically balances traffic and reroutes around failures\n- **Real-time optimization**: Monitors congestion and automatically adjusts paths\n\n#### **Torus-2QoS**\n- **Specialized**: Uses non-minimal multi-path routing (min-hop+1, min-hop+3) to achieve maximum throughput\n- **Purpose**: Optimized for torus topologies with load balancing\n\n#### **Dragonfly+**\n- **Structure**: Hierarchical with local and global groups\n- **Benefit**: Efficient for large-scale deployments\n\n---\n\n## Troubleshooting Common Issues\n\n### Heavy Sweep vs Light Sweep\n\n#### **Heavy Sweeps** are triggered by:\n- Server reboots (topology changes)\n- Switch resets (fabric reconfiguration needed)\n- Any event requiring complete topology rediscovery and LID reassignment\n\n#### **Light Sweeps** are for:\n- **Periodic monitoring** - runs at regular intervals\n- **Port status tracking** - monitors port state changes\n- Performance monitoring (like high transmit packet wait timer values)\n\n### Common Port States and Their Meanings\n\n#### **State: Initializing + Physical State: LinkUp**\n**Most likely cause**: No running Subnet Manager in the fabric\n- Physical link is good (LinkUp)\n- But port can't complete initialization (Base LID: 0, SM LID: 0)\n- Port needs SM to assign LID and transition to Active state\n\n#### **Link Layer: Ethernet (instead of InfiniBand)**\n**Cause**: HCA port is not part of the InfiniBand fabric\n- VPI cards can operate in either mode\n- Port is working fine but in wrong protocol mode\n- Needs reconfiguration from Ethernet to InfiniBand mode\n\n### Network Connectivity Issues\n\n#### **Credit-based Flow Control**\n**Goal**: Assure that packets are not lost within the fabric even in the presence of congestion\n- Tracks receiver buffer space\n- Prevents packet overflow\n- Maintains lossless operation\n- This is what makes InfiniBand \"lossless\"\n\n---\n\n## Essential Commands and Tools\n\n### **Hardware Discovery**\n```bash\n# Check for NVIDIA/Mellanox HCAs\nlspci | grep Mellanox\n\n# Get detailed HCA information including PSID\nibv_devinfo\n\n# Check port status\nibstat\n```\n\n### **Network Diagnostics**\n```bash\n# Discover fabric topology\nibnetdiscover\n\n# Test connectivity between nodes\nibping \u003cdestination\u003e\n\n# Trace packet path through fabric\nibtracert \u003cdestination\u003e\n\n# Check local port status\nibstatus\n```\n\n### **Firmware Management**\n```bash\n# Query current firmware (use PSID to identify suitable firmware)\nibv_devinfo  # Gets PSID for firmware selection\n\n# Firmware upgrade tool\nflint -d \u003cdevice\u003e -i \u003cfirmware_file\u003e burn\n```\n\n### **OpenSM Management**\n- **OpenSM is embedded in DOCA-OFED** - no separate installation needed\n- Check OpenSM logs for routing issues:\n```bash\ncat /var/log/opensm.log | grep updn\n```\n\n### **Driver Information**\n```bash\n# Check DOCA-OFED version\nofed_info\n\n# Check MST (Mellanox Software Tools) status\nmst status\n```\n\n---\n\n## Customer Applications\n\n### 🏢 Market Segments\n1. **High Performance Computing (HPC)**: Weather modeling, scientific simulations\n2. **Artificial Intelligence**: Deep learning training and inference\n3. **Data Science**: Big data analytics and processing\n4. **Cloud Computing**: Hyperscale data centers\n5. **Machine Learning**: Model training and deployment\n\n### 📊 Real-World Examples\n- **Supercomputers**: Systems with 2K-8K nodes delivering 9-35 PFlops performance\n- **Cloud Providers**: Microsoft Azure HPC/AI cloud infrastructure\n- **Research Institutions**: Universities and national labs worldwide\n\n---\n\n## Study Tips \u0026 Practice Questions\n\n### 🎯 Memory Techniques\n\n**Architecture Layers (Bottom to Top)**: \"Please Link Networks To Users\"\n- **P**hysical\n- **L**ink\n- **N**etwork\n- **T**ransport\n- **U**pper\n\n### Key Numbers to Remember\n- **Latency**: 1 microsecond (1,000 nanoseconds)\n- **Bandwidth**: 400 Gb/s (current top speed)\n- **Subnet size**: 48,000 nodes max (NOT 4,000!)\n- **Self-healing**: 5000x faster than traditional (1ms vs 5s)\n- **Maximum packet payload**: 4096 bytes (4KB)\n- **Management VL**: VL15 exclusively for subnet management\n\n### 🚨 Critical Distinctions to Master\n\n**Component Roles** (Frequently Confused):\n- **Subnet Manager**: Assigns LIDs, manages single subnet, enables plug-and-play\n- **Adaptive Routing**: Dynamic load balancing and congestion avoidance\n- **InfiniBand Router**: Connects different InfiniBand subnets (uses GIDs)\n- **InfiniBand Gateway**: Connects InfiniBand to Ethernet networks\n- **SHARP**: Optimizes collective operations (MPI performance)\n- **Self-Healing**: Fast fault recovery (1ms vs 5s traditional)\n\n**Layer Responsibilities**:\n- **Physical Layer**: Signal integrity, cable specifications, bit transmission\n- **Link Layer**: Flow control, LID-based switching, lossless networking\n- **Network Layer**: Inter-subnet routing with GIDs, uses Global Routing Header (GRH)\n- **Transport Layer**: Hardware-based protocols, RDMA, CPU offloading\n- **Upper Layer**: Application interfaces (MPI, NCCL, SDP, IPoIB, etc.)\n\n### 📝 Practice Questions with Detailed Explanations\n\n#### **Question 1: Subnet Management**\n*Which component assigns LIDs to nodes in the network?*\n\n**Answer**: Subnet Manager\n\n**Explanation**: The SM is responsible for LID assignment to all nodes (both servers and switches), enabling the plug-and-play functionality that makes InfiniBand easy to manage.\n\n#### **Question 2: Inter-subnet Communication**\n*How can traffic be routed between two different InfiniBand subnets?*\n\n**Answer**: By using an InfiniBand router\n\n**Explanation**: While gateways connect InfiniBand to Ethernet, routers specifically connect different InfiniBand subnets using GIDs for addressing.\n\n#### **Question 4: Routing Protocols**\n*What upper layer protocol should be used for parallel computing in HPC?*\n\n**Answer**: MPI (Message Passing Interface)\n\n**Explanation**: MPI is specifically designed for parallel computing and is the standard for HPC applications requiring coordination between multiple compute nodes.\n\n#### **Question 5: Network Troubleshooting**\n*A server shows \"State: Initializing\" and \"Base LID: 0\". What's the most likely cause?*\n\n**Answer**: No running Subnet Manager in the fabric\n\n**Explanation**: The port has physical connectivity but can't complete initialization because there's no SM available to assign a LID.\n\n### 🔍 Common Quiz Traps to Avoid\n\n**❌ Don't Confuse These**:\n- **Router vs Gateway**: Router connects IB subnets; Gateway connects IB to Ethernet\n- **4,000 vs 48,000**: Single subnet supports 48,000 nodes, not 4,000\n- **5 seconds vs 1 millisecond**: Traditional recovery is 5s; Self-Healing is 1ms\n- **Flow Control Layer**: It's Link Layer, NOT Network Layer\n- **LID Assignment**: Done by Subnet Manager, NOT switches or routers\n- **Adaptive Routing vs SHARP**: AR is for load balancing; SHARP is for collective operations\n\n**✅ Remember These Key Phrases**:\n- \"Hardware-based transport services implemented by HCAs\"\n- \"Credit-based flow control ensures lossless networking\"\n- \"Aggregating multiple physical lanes provides higher bandwidth\"\n- \"Subnet Manager assigns LIDs and enables plug-and-play\"\n- \"Physical layer ensures signal integrity\"\n\n### 📚 Additional Study Resources\n- Focus on understanding the technical concepts and their relationships\n- Practice drawing the five-layer architecture from memory\n- Study the bandwidth evolution timeline\n- Understand the difference between edge switches and director switches\n- Learn the key NVIDIA product families and their capabilities\n\n---\n\n## Quick Reference Cards\n\n### InfiniBand vs TCP/IP Comparison\n\n| Aspect | TCP/IP | InfiniBand |\n|--------|--------|------------|\n| Transport | Software-based | Hardware-based |\n| CPU Usage | High | Near-zero |\n| Latency | Milliseconds | Microseconds |\n| Packet Loss | Possible | Lossless |\n| Management | Distributed | Centralized (SM) |\n\n### Key Acronyms\n- **HCA**: Host Channel Adapter\n- **SM**: Subnet Manager\n- **LID**: Local ID\n- **GID**: Global ID\n- **RDMA**: Remote Direct Memory Access\n- **QoS**: Quality of Service\n- **SHARP**: Scalable Hierarchical Aggregation and Reduction Protocol\n- **DAC**: Direct Attach Copper\n- **AOC**: Active Optical Cable\n- **VPI**: Virtual Protocol Interconnect\n- **PSID**: Parameter Set ID\n\n---\n\n## Final Note\n\nThe key to mastering InfiniBand is understanding that it's designed for **speed, efficiency, and scale** - everything else follows from these core principles! Whether you're troubleshooting fabric issues, selecting the right routing engine, or optimizing for specific workloads, always consider how each decision impacts these fundamental goals.\n\nThe combination of hardware-based transport, lossless networking, and intelligent management makes InfiniBand the backbone of modern HPC, AI, and high-performance computing environments. Master these concepts, and you'll be well-equipped to design, deploy, and troubleshoot even the most complex InfiniBand fabrics.\n\n---\n\n## Contact\n\n**Scott Thornton** — AI Security Researcher\n\n- Website: [perfecxion.ai](https://perfecxion.ai/)\n- Email: [scott@perfecxion.ai](mailto:scott@perfecxion.ai)\n- LinkedIn: [linkedin.com/in/scthornton](https://www.linkedin.com/in/scthornton)\n- ORCID: [0009-0008-0491-0032](https://orcid.org/0009-0008-0491-0032)\n- GitHub: [@scthornton](https://github.com/scthornton)\n\n**Security Issues**: Please report via [SECURITY.md](SECURITY.md)\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fscthornton%2Finfiniband","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Fscthornton%2Finfiniband","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fscthornton%2Finfiniband/lists"}