{"id":30305314,"url":"https://github.com/nikhilrout/thegemmcoreproject","last_synced_at":"2025-08-17T08:08:37.148Z","repository":{"id":279486251,"uuid":"938967641","full_name":"NikhilRout/TheGEMMCoreProject","owner":"NikhilRout","description":"SystemVerilog Implementation of Nvidia's CUDA/Tensor Core GEMM Operations","archived":false,"fork":false,"pushed_at":"2025-08-14T05:24:50.000Z","size":18800,"stargazers_count":5,"open_issues_count":0,"forks_count":1,"subscribers_count":1,"default_branch":"main","last_synced_at":"2025-08-14T07:18:31.568Z","etag":null,"topics":["cuda","floating-point","gemm","gpgpu","hybrid-precision-training","sparse-matrix","systolic-array","tensorcore","tpu"],"latest_commit_sha":null,"homepage":"","language":"Verilog","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":"bsd-3-clause","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/NikhilRout.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":null,"license":"LICENSE","code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null,"roadmap":null,"authors":null,"dei":null,"publiccode":null,"codemeta":null,"zenodo":null}},"created_at":"2025-02-25T19:29:00.000Z","updated_at":"2025-08-14T05:24:53.000Z","dependencies_parsed_at":"2025-08-14T07:29:06.291Z","dependency_job_id":null,"html_url":"https://github.com/NikhilRout/TheGEMMCoreProject","commit_stats":null,"previous_names":["nikhilrout/tensorcoreproject","nikhilrout/thetensorcoreproject","nikhilrout/thegemmcoreproject"],"tags_count":0,"template":false,"template_full_name":null,"purl":"pkg:github/NikhilRout/TheGEMMCoreProject","repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/NikhilRout%2FTheGEMMCoreProject","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/NikhilRout%2FTheGEMMCoreProject/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/NikhilRout%2FTheGEMMCoreProject/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/NikhilRout%2FTheGEMMCoreProject/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/NikhilRout","download_url":"https://codeload.github.com/NikhilRout/TheGEMMCoreProject/tar.gz/refs/heads/main","sbom_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/NikhilRout%2FTheGEMMCoreProject/sbom","scorecard":null,"host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":270820793,"owners_count":24651534,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","status":"online","status_checked_at":"2025-08-17T02:00:09.016Z","response_time":129,"last_error":null,"robots_txt_status":"success","robots_txt_updated_at":"2025-07-24T06:49:26.215Z","robots_txt_url":"https://github.com/robots.txt","online":true,"can_crawl_api":true,"host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["cuda","floating-point","gemm","gpgpu","hybrid-precision-training","sparse-matrix","systolic-array","tensorcore","tpu"],"created_at":"2025-08-17T08:08:36.558Z","updated_at":"2025-08-17T08:08:37.129Z","avatar_url":"https://github.com/NikhilRout.png","language":"Verilog","funding_links":[],"categories":[],"sub_categories":[],"readme":"# TheGEMMCoreProject\nSystemVerilog implementation of Nvidia's SIMT CUDA, Hybrid-Precision Tensor Core, and Google's Systolic Array TPU MXU GEMM Operations. \nThese modules are by no means really emulating the actual microarchitecture executing CUDA/Tensor Core instructions, instead they're simply performing the same operation for direct usage in FPGA designs. \n\nGo check out my work on the Vortex GPGPU's [Tensor Core Unit (TCU) extension's DRL Floating Point RTL backend](https://github.com/vortexgpgpu/vortex/tree/bug_fixes/hw/rtl/tcu) for a more optimized, realistic microarchitecture implementation.\n\n## Tensor Core Versions\n### TensorCore v0: Volta Architecture [FP16MUL FP32ADD]\n\u003cdiv align=\"center\"\u003e\n  \u003cimg src=\"./Arch%20Diags/VoltaTensorCore2.png\" alt=\"Volta Tensor Core Architecture Diagram\" width=\"600\"\u003e\n\u003c/div\u003e\n\u003cdiv align=\"center\"\u003e\n  \u003cimg src=\"./Arch%20Diags/VoltaTensorCore.png\" alt=\"Volta Tensor Core Architecture Diagram\" width=\"600\"\u003e\n\u003c/div\u003e\n\n### TensorCore v1: Ampere Architecture [TF32MUL FP32ADD / BF16MUL FP32ADD] + Fine-Grained Structured Sparsity\n\u003cdiv align=\"center\"\u003e\n  \u003cimg src=\"./Arch%20Diags/AmpereTensorCoreTF32.png\" alt=\"Ampere Tensor Core Architecture Diagram\" width=\"600\"\u003e\n\u003c/div\u003e\n\u003cdiv align=\"center\"\u003e\n  \u003cimg src=\"./Arch Diags/Fine-Grained Structured Sparsity.png\" alt=\"Ampere Tensor Core Architecture Diagram\" width=\"600\"\u003e\n\u003c/div\u003e\n\n### TensorCore v2: Hopper Architecture [FP8(E5M2/E4M3)MUL FP16ADD]\n\u003cdiv align=\"center\"\u003e\n  \u003cimg src=\"./Arch Diags/FP8HopperTensorCore.png\" alt=\"Hopper Tensor Core Architecture Diagram\" width=\"600\"\u003e\n\u003c/div\u003e\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fnikhilrout%2Fthegemmcoreproject","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Fnikhilrout%2Fthegemmcoreproject","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fnikhilrout%2Fthegemmcoreproject/lists"}