{"id":21986911,"url":"https://github.com/saforem2/llm-workshop-talk","last_synced_at":"2026-01-23T06:20:41.173Z","repository":{"id":222224732,"uuid":"756533852","full_name":"saforem2/llm-workshop-talk","owner":"saforem2","description":"Simple tutorial on creating Small(-ish) LLMs (pt. 2 🎉!!)","archived":false,"fork":false,"pushed_at":"2024-09-10T14:51:56.000Z","size":5288,"stargazers_count":3,"open_issues_count":0,"forks_count":0,"subscribers_count":1,"default_branch":"main","last_synced_at":"2025-12-01T04:43:44.948Z","etag":null,"topics":[],"latest_commit_sha":null,"homepage":"https://saforem2.github.io/llm-workshop-talk/","language":"Python","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":null,"status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/saforem2.png","metadata":{"files":{"readme":"docs/README 2.html","changelog":null,"contributing":null,"funding":null,"license":null,"code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null,"roadmap":null,"authors":null,"dei":null,"publiccode":null,"codemeta":null,"zenodo":null}},"created_at":"2024-02-12T20:58:36.000Z","updated_at":"2024-12-13T07:00:33.000Z","dependencies_parsed_at":"2025-04-30T08:48:25.753Z","dependency_job_id":null,"html_url":"https://github.com/saforem2/llm-workshop-talk","commit_stats":null,"previous_names":["saforem2/llm-workshop-talk"],"tags_count":0,"template":false,"template_full_name":null,"purl":"pkg:github/saforem2/llm-workshop-talk","repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/saforem2%2Fllm-workshop-talk","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/saforem2%2Fllm-workshop-talk/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/saforem2%2Fllm-workshop-talk/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/saforem2%2Fllm-workshop-talk/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/saforem2","download_url":"https://codeload.github.com/saforem2/llm-workshop-talk/tar.gz/refs/heads/main","sbom_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/saforem2%2Fllm-workshop-talk/sbom","scorecard":null,"host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":286080680,"owners_count":28682259,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2026-01-23T05:48:07.525Z","status":"ssl_error","status_checked_at":"2026-01-23T05:48:07.129Z","response_time":59,"last_error":"SSL_read: unexpected eof while reading","robots_txt_status":"success","robots_txt_updated_at":"2025-07-24T06:49:26.215Z","robots_txt_url":"https://github.com/robots.txt","online":false,"can_crawl_api":true,"host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":[],"created_at":"2024-11-29T18:22:38.075Z","updated_at":"2026-01-23T06:20:41.124Z","avatar_url":"https://github.com/saforem2.png","language":"Python","funding_links":[],"categories":[],"sub_categories":[],"readme":"\u003c!DOCTYPE html\u003e\n\u003chtml xmlns=\"http://www.w3.org/1999/xhtml\" lang=\"en\" xml:lang=\"en\"\u003e\u003chead\u003e\n\n\u003cmeta charset=\"utf-8\"\u003e\n\u003cmeta name=\"generator\" content=\"quarto-1.4.549\"\u003e\n\n\u003cmeta name=\"viewport\" content=\"width=device-width, initial-scale=1.0, user-scalable=yes\"\u003e\n\n\u003cmeta name=\"author\" content=\"Sam Foreman \"\u003e\n\u003cmeta name=\"dcterms.date\" content=\"2024-02-13\"\u003e\n\n\u003ctitle\u003eCreating Small(-ish) LLMs\u003c/title\u003e\n\u003cstyle\u003e\ncode{white-space: pre-wrap;}\nspan.smallcaps{font-variant: small-caps;}\ndiv.columns{display: flex; gap: min(4vw, 1.5em);}\ndiv.column{flex: auto; overflow-x: auto;}\ndiv.hanging-indent{margin-left: 1.5em; text-indent: -1.5em;}\nul.task-list{list-style: none;}\nul.task-list li input[type=\"checkbox\"] {\n  width: 0.8em;\n  margin: 0 0.8em 0.2em -1em; /* quarto-specific, see https://github.com/quarto-dev/quarto-cli/issues/4556 */ \n  vertical-align: middle;\n}\n/* CSS for syntax highlighting */\npre \u003e code.sourceCode { white-space: pre; position: relative; }\npre \u003e code.sourceCode \u003e span { line-height: 1.25; }\npre \u003e code.sourceCode \u003e span:empty { height: 1.2em; }\n.sourceCode { overflow: visible; }\ncode.sourceCode \u003e span { color: inherit; text-decoration: inherit; }\ndiv.sourceCode { margin: 1em 0; }\npre.sourceCode { margin: 0; }\n@media screen {\ndiv.sourceCode { overflow: auto; }\n}\n@media print {\npre \u003e code.sourceCode { white-space: pre-wrap; }\npre \u003e code.sourceCode \u003e span { text-indent: -5em; padding-left: 5em; }\n}\npre.numberSource code\n  { counter-reset: source-line 0; }\npre.numberSource code \u003e span\n  { position: relative; left: -4em; counter-increment: source-line; }\npre.numberSource code \u003e span \u003e a:first-child::before\n  { content: counter(source-line);\n    position: relative; left: -1em; text-align: right; vertical-align: baseline;\n    border: none; display: inline-block;\n    -webkit-touch-callout: none; -webkit-user-select: none;\n    -khtml-user-select: none; -moz-user-select: none;\n    -ms-user-select: none; user-select: none;\n    padding: 0 4px; width: 4em;\n  }\npre.numberSource { margin-left: 3em;  padding-left: 4px; }\ndiv.sourceCode\n  {   }\n@media screen {\npre \u003e code.sourceCode \u003e span \u003e a:first-child::before { text-decoration: underline; }\n}\n/* CSS for citations */\ndiv.csl-bib-body { }\ndiv.csl-entry {\n  clear: both;\n  margin-bottom: 0em;\n}\n.hanging-indent div.csl-entry {\n  margin-left:2em;\n  text-indent:-2em;\n}\ndiv.csl-left-margin {\n  min-width:2em;\n  float:left;\n}\ndiv.csl-right-inline {\n  margin-left:2em;\n  padding-left:1em;\n}\ndiv.csl-indent {\n  margin-left: 2em;\n}\u003c/style\u003e\n\n\n\u003cscript src=\"site_libs/quarto-nav/quarto-nav.js\"\u003e\u003c/script\u003e\n\u003cscript src=\"site_libs/quarto-nav/headroom.min.js\"\u003e\u003c/script\u003e\n\u003cscript src=\"site_libs/clipboard/clipboard.min.js\"\u003e\u003c/script\u003e\n\u003cscript src=\"site_libs/quarto-search/autocomplete.umd.js\"\u003e\u003c/script\u003e\n\u003cscript src=\"site_libs/quarto-search/fuse.min.js\"\u003e\u003c/script\u003e\n\u003cscript src=\"site_libs/quarto-search/quarto-search.js\"\u003e\u003c/script\u003e\n\u003cmeta name=\"quarto:offset\" content=\"./\"\u003e\n\u003clink href=\"././favicon.svg\" rel=\"icon\" type=\"image/svg+xml\"\u003e\n\u003cscript src=\"site_libs/quarto-html/quarto.js\"\u003e\u003c/script\u003e\n\u003cscript src=\"site_libs/quarto-html/popper.min.js\"\u003e\u003c/script\u003e\n\u003cscript src=\"site_libs/quarto-html/tippy.umd.min.js\"\u003e\u003c/script\u003e\n\u003cscript src=\"site_libs/quarto-html/anchor.min.js\"\u003e\u003c/script\u003e\n\u003clink href=\"site_libs/quarto-html/tippy.css\" rel=\"stylesheet\"\u003e\n\u003clink href=\"site_libs/quarto-html/quarto-syntax-highlighting.css\" rel=\"stylesheet\" class=\"quarto-color-scheme\" id=\"quarto-text-highlighting-styles\"\u003e\n\u003clink href=\"site_libs/quarto-html/quarto-syntax-highlighting-dark.css\" rel=\"prefetch\" class=\"quarto-color-scheme quarto-color-alternate\" id=\"quarto-text-highlighting-styles\"\u003e\n\u003cscript src=\"site_libs/bootstrap/bootstrap.min.js\"\u003e\u003c/script\u003e\n\u003clink href=\"site_libs/bootstrap/bootstrap-icons.css\" rel=\"stylesheet\"\u003e\n\u003clink href=\"site_libs/bootstrap/bootstrap.min.css\" rel=\"stylesheet\" class=\"quarto-color-scheme\" id=\"quarto-bootstrap\" data-mode=\"light\"\u003e\n\u003clink href=\"site_libs/bootstrap/bootstrap-dark.min.css\" rel=\"prefetch\" class=\"quarto-color-scheme quarto-color-alternate\" id=\"quarto-bootstrap\" data-mode=\"light\"\u003e\n\u003clink href=\"site_libs/quarto-contrib/fontawesome6-0.1.0/all.css\" rel=\"stylesheet\"\u003e\n\u003clink href=\"site_libs/quarto-contrib/fontawesome6-0.1.0/latex-fontsize.css\" rel=\"stylesheet\"\u003e\n\u003clink href=\"site_libs/quarto-contrib/academicons-1.9.2/all.css\" rel=\"stylesheet\"\u003e\n\u003clink href=\"site_libs/quarto-contrib/academicons-1.9.2/size.css\" rel=\"stylesheet\"\u003e\n\u003cscript id=\"quarto-search-options\" type=\"application/json\"\u003e{\n  \"location\": \"navbar\",\n  \"copy-button\": false,\n  \"collapse-after\": 3,\n  \"panel-placement\": \"end\",\n  \"type\": \"overlay\",\n  \"limit\": 50,\n  \"keyboard-shortcut\": [\n    \"f\",\n    \"/\",\n    \"s\"\n  ],\n  \"show-item-context\": false,\n  \"language\": {\n    \"search-no-results-text\": \"No results\",\n    \"search-matching-documents-text\": \"matching documents\",\n    \"search-copy-link-title\": \"Copy link to search\",\n    \"search-hide-matches-text\": \"Hide additional matches\",\n    \"search-more-match-text\": \"more match in this document\",\n    \"search-more-matches-text\": \"more matches in this document\",\n    \"search-clear-button-title\": \"Clear\",\n    \"search-text-placeholder\": \"\",\n    \"search-detached-cancel-button-title\": \"Cancel\",\n    \"search-submit-button-title\": \"Submit\",\n    \"search-label\": \"Search\"\n  }\n}\u003c/script\u003e\n\u003cscript async=\"\" src=\"https://www.googletagmanager.com/gtag/js?id=G-XVM2Y822Y1\"\u003e\u003c/script\u003e\n\n\u003cscript type=\"text/javascript\"\u003e\n\nwindow.dataLayer = window.dataLayer || [];\nfunction gtag(){dataLayer.push(arguments);}\ngtag('js', new Date());\ngtag('config', 'G-XVM2Y822Y1', { 'anonymize_ip': true});\n\u003c/script\u003e\n\n\n\u003clink rel=\"stylesheet\" href=\"css/default.css\"\u003e\n\u003clink rel=\"stylesheet\" href=\"css/callouts.css\"\u003e\n\u003cmeta property=\"og:title\" content=\"Creating Small(-ish) LLMs\"\u003e\n\u003cmeta property=\"og:description\" content=\"Creating Small(-ish) LLMs\"\u003e\n\u003cmeta property=\"og:site_name\" content=\"Creating Small(-ish) LLMs\"\u003e\n\u003cmeta name=\"twitter:title\" content=\"Creating Small(-ish) LLMs\"\u003e\n\u003cmeta name=\"twitter:description\" content=\"Creating Small(-ish) LLMs\"\u003e\n\u003cmeta name=\"twitter:image\" content=\"https://saforem2.github.io/LLM-tutorial/assets/thumbnail.png\"\u003e\n\u003cmeta name=\"twitter:creator\" content=\"@saforem2\"\u003e\n\u003cmeta name=\"twitter:site\" content=\"@saforem2\"\u003e\n\u003cmeta name=\"twitter:card\" content=\"summary_large_image\"\u003e\n\u003cmeta name=\"citation_title\" content=\"Creating Small(-ish) LLMs\"\u003e\n\u003cmeta name=\"citation_author\" content=\"Sam Foreman\"\u003e\n\u003cmeta name=\"citation_publication_date\" content=\"2024-02-13\"\u003e\n\u003cmeta name=\"citation_cover_date\" content=\"2024-02-13\"\u003e\n\u003cmeta name=\"citation_year\" content=\"2024\"\u003e\n\u003cmeta name=\"citation_online_date\" content=\"2024-02-13\"\u003e\n\u003cmeta name=\"citation_fulltext_html_url\" content=\"https://saforem2.github.io/LLM-tutorial\"\u003e\n\u003cmeta name=\"citation_language\" content=\"en\"\u003e\n\u003cmeta name=\"citation_reference\" content=\"citation_title=Progress on (g-2)_\\mu from lattice QCD;,citation_author=Hartmut Wittig;,citation_publication_date=2023;,citation_cover_date=2023;,citation_year=2023;,citation_fulltext_html_url=https://arxiv.org/abs/2306.04165;\"\u003e\n\u003cmeta name=\"citation_reference\" content=\"citation_title=Hybrid Monte Carlo;,citation_author=S. Duane;,citation_author=A. D. Kennedy;,citation_author=B. J. Pendleton;,citation_author=D. Roweth;,citation_publication_date=1987;,citation_cover_date=1987;,citation_year=1987;,citation_doi=10.1016/0370-2693(87)91197-X;,citation_volume=195;,citation_journal_title=Phys. Lett. B;\"\u003e\n\u003cmeta name=\"citation_reference\" content=\"citation_title=Snowmass 2021 Computational Frontier CompF03 Topical Group Report: Machine Learning;,citation_author=Phiala Shanahan;,citation_author=others;,citation_publication_date=2022-09;,citation_cover_date=2022-09;,citation_year=2022;,citation_fulltext_html_url=https://arxiv.org/abs/2209.07559;\"\u003e\n\u003cmeta name=\"citation_reference\" content=\"citation_title=Applications of Machine Learning to Lattice Quantum Field Theory;,citation_author=Denis Boyda;,citation_author=others;,citation_publication_date=2022-02;,citation_cover_date=2022-02;,citation_year=2022;,citation_fulltext_html_url=https://arxiv.org/abs/2202.05838;,citation_conference_title=Snowmass 2021;\"\u003e\n\u003cmeta name=\"citation_reference\" content=\"citation_title=LeapfrogLayers: A Trainable Framework for Effective Topological Sampling;,citation_author=Sam Foreman;,citation_author=Xiao-Yong Jin;,citation_author=James C. Osborn;,citation_publication_date=2022-05;,citation_cover_date=2022-05;,citation_year=2022;,citation_fulltext_html_url=https://arxiv.org/abs/2112.01582;,citation_doi=10.22323/1.396.0508;,citation_volume=LATTICE2021;,citation_journal_title=PoS;\"\u003e\n\u003cmeta name=\"citation_reference\" content=\"citation_title=HMC with Normalizing Flows;,citation_author=Sam Foreman;,citation_author=Taku Izubuchi;,citation_author=Luchang Jin;,citation_author=Xiao-Yong Jin;,citation_author=James C. Osborn;,citation_author=Akio Tomiya;,citation_publication_date=2022;,citation_cover_date=2022;,citation_year=2022;,citation_fulltext_html_url=https://arxiv.org/abs/2112.01586;,citation_doi=10.22323/1.396.0073;,citation_volume=LATTICE2021;,citation_journal_title=PoS;\"\u003e\n\u003cmeta name=\"citation_reference\" content=\"citation_title=Deep Learning Hamiltonian Monte Carlo;,citation_author=Sam Foreman;,citation_author=Xiao-Yong Jin;,citation_author=James C. Osborn;,citation_publication_date=2021-05;,citation_cover_date=2021-05;,citation_year=2021;,citation_fulltext_html_url=https://arxiv.org/abs/2105.03418;,citation_conference_title=9th International Conference on Learning Representations;\"\u003e\n\u003cmeta name=\"citation_reference\" content=\"citation_title=Mastering language models;,citation_author=Samuel Montgomery;,citation_publication_date=2023-10;,citation_cover_date=2023-10;,citation_year=2023;,citation_fulltext_html_url=https://towardsdatascience.com/mastering-language-models-32e1d891511a;,citation_journal_title=Medium;,citation_publisher=Towards Data Science;\"\u003e\n\u003cmeta name=\"citation_reference\" content=\"citation_title=Harnessing the power of LLMs in practice: A survey on ChatGPT and beyond;,citation_author=Jingfeng Yang;,citation_author=Hongye Jin;,citation_author=Ruixiang Tang;,citation_author=Xiaotian Han;,citation_author=Qizhang Feng;,citation_author=Haoming Jiang;,citation_author=Bing Yin;,citation_author=Xia Hu;,citation_publication_date=2023;,citation_cover_date=2023;,citation_year=2023;,citation_fulltext_html_url=https://arxiv.org/abs/2304.13712;\"\u003e\n\u003cmeta name=\"citation_reference\" content=\"citation_title=Training tips for the transformer model;,citation_author=Martin Popel;,citation_author=Ondřej Bojar;,citation_publication_date=2018-04;,citation_cover_date=2018-04;,citation_year=2018;,citation_fulltext_html_url=https://doi.org/10.2478%2Fpralin-2018-0002;,citation_issue=1;,citation_doi=10.2478/pralin-2018-0002;,citation_volume=110;,citation_journal_title=The Prague Bulletin of Mathematical Linguistics;,citation_publisher=Charles University in Prague, Karolinum Press;\"\u003e\n\u003cmeta name=\"citation_reference\" content=\"citation_title=Attention is all you need;,citation_author=Ashish Vaswani;,citation_author=Noam Shazeer;,citation_author=Niki Parmar;,citation_author=Jakob Uszkoreit;,citation_author=Llion Jones;,citation_author=Aidan N. Gomez;,citation_author=Lukasz Kaiser;,citation_author=Illia Polosukhin;,citation_publication_date=2017;,citation_cover_date=2017;,citation_year=2017;,citation_fulltext_html_url=https://arxiv.org/abs/1706.03762;\"\u003e\n\u003cmeta name=\"citation_reference\" content=\"citation_title=Tree of thoughts: Deliberate problem solving with large language models;,citation_author=Shunyu Yao;,citation_author=Dian Yu;,citation_author=Jeffrey Zhao;,citation_author=Izhak Shafran;,citation_author=Thomas L. Griffiths;,citation_author=Yuan Cao;,citation_author=Karthik Narasimhan;,citation_publication_date=2023;,citation_cover_date=2023;,citation_year=2023;,citation_fulltext_html_url=https://arxiv.org/abs/2305.10601;\"\u003e\n\u003cmeta name=\"citation_reference\" content=\"citation_title=GenSLMs: Genome-scale language models reveal SARS-CoV-2 evolutionary dynamics;,citation_abstract=We seek to transform how new and emergent variants of pandemiccausing viruses, specifically SARS-CoV-2, are identified and classified. By adapting large language models (LLMs) for genomic data, we build genome-scale language models (GenSLMs) which can learn the evolutionary landscape of SARS-CoV-2 genomes. By pretraining on over 110 million prokaryotic gene sequences and finetuning a SARS-CoV-2-specific model on 1.5 million genomes, we show that GenSLMs can accurately and rapidly identify variants of concern. Thus, to our knowledge, GenSLMs represents one of the first whole genome scale foundation models which can generalize to other prediction tasks. We demonstrate scaling of GenSLMs on GPU-based supercomputers and AI-hardware accelerators utilizing 1.63 Zettaflops in training runs with a sustained performance of 121 PFLOPS in mixed precision and peak of 850 PFLOPS. We present initial scientific insights from examining GenSLMs in tracking evolutionary dynamics of SARS-CoV-2, paving the path to realizing this on large biological data.Competing Interest StatementThe authors have declared no competing interest.;,citation_author=Maxim Zvyagin;,citation_author=Alexander Brace;,citation_author=Kyle Hippe;,citation_author=Yuntian Deng;,citation_author=Bin Zhang;,citation_author=Cindy Orozco Bohorquez;,citation_author=Austin Clyde;,citation_author=Bharat Kale;,citation_author=Danilo Perez-Rivera;,citation_author=Heng Ma;,citation_author=Carla M. Mann;,citation_author=Michael Irvin;,citation_author=J. Gregory Pauloski;,citation_author=Logan Ward;,citation_author=Valerie Hayot-Sasson;,citation_author=Murali Emani;,citation_author=Sam Foreman;,citation_author=Zhen Xie;,citation_author=Diangen Lin;,citation_author=Maulik Shukla;,citation_author=Weili Nie;,citation_author=Josh Romero;,citation_author=Christian Dallago;,citation_author=Arash Vahdat;,citation_author=Chaowei Xiao;,citation_author=Thomas Gibbs;,citation_author=Ian Foster;,citation_author=James J. Davis;,citation_author=Michael E. Papka;,citation_author=Thomas Brettin;,citation_author=Rick Stevens;,citation_author=Anima Anandkumar;,citation_author=Venkatram Vishwanath;,citation_author=Arvind Ramanathan;,citation_publication_date=2022;,citation_cover_date=2022;,citation_year=2022;,citation_fulltext_html_url=https://www.biorxiv.org/content/early/2022/11/23/2022.10.10.511571;,citation_doi=10.1101/2022.10.10.511571;,citation_journal_title=bioRxiv;,citation_publisher=Cold Spring Harbor Laboratory;\"\u003e\n\u003c/head\u003e\n\n\u003cbody class=\"nav-fixed\"\u003e\n\n\u003cdiv id=\"quarto-search-results\"\u003e\u003c/div\u003e\n  \u003cheader id=\"quarto-header\" class=\"headroom fixed-top\"\u003e\n    \u003cnav class=\"navbar navbar-expand-lg \" data-bs-theme=\"dark\"\u003e\n      \u003cdiv class=\"navbar-container container-fluid\"\u003e\n      \u003cdiv class=\"navbar-brand-container mx-auto\"\u003e\n    \u003ca href=\"./index.html\" class=\"navbar-brand navbar-brand-logo\"\u003e\n    \u003cimg src=\"././favicon.svg\" alt=\"\" class=\"navbar-logo\"\u003e\n    \u003c/a\u003e\n    \u003ca class=\"navbar-brand\" href=\"./index.html\"\u003e\n    \u003cspan class=\"navbar-title\"\u003eCreating Small(-ish) LLMs\u003c/span\u003e\n    \u003c/a\u003e\n  \u003c/div\u003e\n            \u003cdiv id=\"quarto-search\" class=\"\" title=\"Search\"\u003e\u003c/div\u003e\n          \u003cbutton class=\"navbar-toggler\" type=\"button\" data-bs-toggle=\"collapse\" data-bs-target=\"#navbarCollapse\" aria-controls=\"navbarCollapse\" aria-expanded=\"false\" aria-label=\"Toggle navigation\" onclick=\"if (window.quartoToggleHeadroom) { window.quartoToggleHeadroom(); }\"\u003e\n  \u003cspan class=\"navbar-toggler-icon\"\u003e\u003c/span\u003e\n\u003c/button\u003e\n          \u003cdiv class=\"collapse navbar-collapse\" id=\"navbarCollapse\"\u003e\n            \u003cul class=\"navbar-nav navbar-nav-scroll ms-auto\"\u003e\n  \u003cli class=\"nav-item compact\"\u003e\n    \u003ca class=\"nav-link\" href=\"https://github.com/saforem2/LLM-tutorial\"\u003e \u003ci class=\"bi bi-github\" role=\"img\" aria-label=\"GitHub\"\u003e\n\u003c/i\u003e \n\u003cspan class=\"menu-text\"\u003e\u003c/span\u003e\u003c/a\u003e\n  \u003c/li\u003e  \n\u003c/ul\u003e\n          \u003c/div\u003e \u003c!-- /navcollapse --\u003e\n          \u003cdiv class=\"quarto-navbar-tools\"\u003e\n  \u003ca href=\"\" class=\"quarto-color-scheme-toggle quarto-navigation-tool  px-1\" onclick=\"window.quartoToggleColorScheme(); return false;\" title=\"Toggle dark mode\"\u003e\u003ci class=\"bi\"\u003e\u003c/i\u003e\u003c/a\u003e\n\u003c/div\u003e\n      \u003c/div\u003e \u003c!-- /container-fluid --\u003e\n    \u003c/nav\u003e\n\u003c/header\u003e\n\u003c!-- content --\u003e\n\u003cdiv id=\"quarto-content\" class=\"quarto-container page-columns page-rows-contents page-layout-article page-navbar\"\u003e\n\u003c!-- sidebar --\u003e\n\u003c!-- margin-sidebar --\u003e\n    \u003cdiv id=\"quarto-margin-sidebar\" class=\"sidebar margin-sidebar\"\u003e\n        \u003cnav id=\"TOC\" role=\"doc-toc\" class=\"toc-active\"\u003e\n    \u003ch2 id=\"toc-title\"\u003eOn this page\u003c/h2\u003e\n   \n  \u003cul\u003e\n  \u003cli\u003e\u003ca href=\"#creating-small-ish-llms\" id=\"toc-creating-small-ish-llms\" class=\"nav-link active\" data-scroll-target=\"#creating-small-ish-llms\"\u003eCreating Small(-ish) LLMs\u003c/a\u003e\u003c/li\u003e\n  \u003cli\u003e\u003ca href=\"#llms-from-scratch\" id=\"toc-llms-from-scratch\" class=\"nav-link\" data-scroll-target=\"#llms-from-scratch\"\u003eLLMs from Scratch\u003c/a\u003e\u003c/li\u003e\n  \u003cli\u003e\u003ca href=\"#emergent-abilities\" id=\"toc-emergent-abilities\" class=\"nav-link\" data-scroll-target=\"#emergent-abilities\"\u003eEmergent Abilities\u003c/a\u003e\u003c/li\u003e\n  \u003cli\u003e\u003ca href=\"#training-llms\" id=\"toc-training-llms\" class=\"nav-link\" data-scroll-target=\"#training-llms\"\u003eTraining LLMs\u003c/a\u003e\u003c/li\u003e\n  \u003cli\u003e\u003ca href=\"#life-cycle-of-the-llm\" id=\"toc-life-cycle-of-the-llm\" class=\"nav-link\" data-scroll-target=\"#life-cycle-of-the-llm\"\u003eLife-Cycle of the LLM\u003c/a\u003e\u003c/li\u003e\n  \u003cli\u003e\u003ca href=\"#forward-pass\" id=\"toc-forward-pass\" class=\"nav-link\" data-scroll-target=\"#forward-pass\"\u003eForward Pass\u003c/a\u003e\u003c/li\u003e\n  \u003cli\u003e\u003ca href=\"#generating-text\" id=\"toc-generating-text\" class=\"nav-link\" data-scroll-target=\"#generating-text\"\u003eGenerating Text\u003c/a\u003e\u003c/li\u003e\n  \u003cli\u003e\u003ca href=\"#life-cycle-of-the-llm-pre-training\" id=\"toc-life-cycle-of-the-llm-pre-training\" class=\"nav-link\" data-scroll-target=\"#life-cycle-of-the-llm-pre-training\"\u003eLife-Cycle of the LLM: Pre-training\u003c/a\u003e\u003c/li\u003e\n  \u003cli\u003e\u003ca href=\"#life-cycle-of-the-llm-fine-tuning\" id=\"toc-life-cycle-of-the-llm-fine-tuning\" class=\"nav-link\" data-scroll-target=\"#life-cycle-of-the-llm-fine-tuning\"\u003eLife-Cycle of the LLM: Fine-Tuning\u003c/a\u003e\u003c/li\u003e\n  \u003cli\u003e\u003ca href=\"#assistant-models\" id=\"toc-assistant-models\" class=\"nav-link\" data-scroll-target=\"#assistant-models\"\u003eAssistant Models\u003c/a\u003e\u003c/li\u003e\n  \u003cli\u003e\u003ca href=\"#saforem2wordplay\" id=\"toc-saforem2wordplay\" class=\"nav-link\" data-scroll-target=\"#saforem2wordplay\"\u003e\u003ccode\u003esaforem2/wordplay\u003c/code\u003e 🎮💬\u003c/a\u003e\u003c/li\u003e\n  \u003cli\u003e\u003ca href=\"#saforem2wordplay-1\" id=\"toc-saforem2wordplay-1\" class=\"nav-link\" data-scroll-target=\"#saforem2wordplay-1\"\u003e\u003ccode\u003esaforem2/wordplay\u003c/code\u003e 🎮💬\u003c/a\u003e\u003c/li\u003e\n  \u003cli\u003e\u003ca href=\"#install\" id=\"toc-install\" class=\"nav-link\" data-scroll-target=\"#install\"\u003eInstall\u003c/a\u003e\u003c/li\u003e\n  \u003cli\u003e\u003ca href=\"#dependencies\" id=\"toc-dependencies\" class=\"nav-link\" data-scroll-target=\"#dependencies\"\u003eDependencies\u003c/a\u003e\u003c/li\u003e\n  \u003cli\u003e\u003ca href=\"#quick-start\" id=\"toc-quick-start\" class=\"nav-link\" data-scroll-target=\"#quick-start\"\u003eQuick Start\u003c/a\u003e\u003c/li\u003e\n  \u003cli\u003e\u003ca href=\"#model-model.py\" id=\"toc-model-model.py\" class=\"nav-link\" data-scroll-target=\"#model-model.py\"\u003eModel \u003ccode\u003emodel.py\u003c/code\u003e\u003c/a\u003e\u003c/li\u003e\n  \u003cli\u003e\u003ca href=\"#trainer-trainer.py\" id=\"toc-trainer-trainer.py\" class=\"nav-link\" data-scroll-target=\"#trainer-trainer.py\"\u003eTrainer \u003ccode\u003etrainer.py\u003c/code\u003e\u003c/a\u003e\u003c/li\u003e\n  \u003cli\u003e\u003ca href=\"#hands-on-tutorial\" id=\"toc-hands-on-tutorial\" class=\"nav-link\" data-scroll-target=\"#hands-on-tutorial\"\u003eHands-on Tutorial\u003c/a\u003e\u003c/li\u003e\n  \u003cli\u003e\u003ca href=\"#section\" id=\"toc-section\" class=\"nav-link\" data-scroll-target=\"#section\"\u003e\u003c/a\u003e\u003c/li\u003e\n  \u003cli\u003e\u003ca href=\"#links\" id=\"toc-links\" class=\"nav-link\" data-scroll-target=\"#links\"\u003eLinks\u003c/a\u003e\u003c/li\u003e\n  \u003cli\u003e\u003ca href=\"#references\" id=\"toc-references\" class=\"nav-link\" data-scroll-target=\"#references\"\u003eReferences\u003c/a\u003e\u003c/li\u003e\n  \u003c/ul\u003e\n\u003cdiv class=\"toc-actions\"\u003e\u003cul\u003e\u003cli\u003e\u003ca href=\"https://github.com/saforem2/LLM-tutorial/blob/main/README 2.md\" class=\"toc-action\"\u003e\u003ci class=\"bi bi-github\"\u003e\u003c/i\u003eView source\u003c/a\u003e\u003c/li\u003e\u003cli\u003e\u003ca href=\"https://github.com/saforem2/LLM-tutorial/edit/main/README 2.md\" class=\"toc-action\"\u003e\u003ci class=\"bi empty\"\u003e\u003c/i\u003eEdit this page\u003c/a\u003e\u003c/li\u003e\u003cli\u003e\u003ca href=\"https://github.com/saforem2/LLM-tutorial/issues/new\" class=\"toc-action\"\u003e\u003ci class=\"bi empty\"\u003e\u003c/i\u003eReport an issue\u003c/a\u003e\u003c/li\u003e\u003c/ul\u003e\u003c/div\u003e\u003c/nav\u003e\n    \u003c/div\u003e\n\u003c!-- main --\u003e\n\u003cmain class=\"content\" id=\"quarto-document-content\"\u003e\n\n\u003cheader id=\"title-block-header\" class=\"quarto-title-block default\"\u003e\n\u003cdiv class=\"quarto-title\"\u003e\n\u003cdiv class=\"quarto-title-block\"\u003e\u003cdiv\u003e\u003ch1 class=\"title\"\u003eCreating Small(-ish) LLMs\u003c/h1\u003e\u003cbutton type=\"button\" class=\"btn code-tools-button\" id=\"quarto-code-tools-source\" data-quarto-source-url=\"https://github.com/saforem2/LLM-tutorial/blob/main/README 2.md\"\u003e\u003ci class=\"bi\"\u003e\u003c/i\u003e\u003c/button\u003e\u003c/div\u003e\u003c/div\u003e\n\u003c/div\u003e\n\n\n\u003cdiv class=\"quarto-title-meta-author\"\u003e\n  \u003cdiv class=\"quarto-title-meta-heading\"\u003e\u003c/div\u003e\n  \u003cdiv class=\"quarto-title-meta-heading\"\u003e\u003c/div\u003e\n  \n    \u003cdiv class=\"quarto-title-meta-contents\"\u003e\n    \u003cp class=\"author\"\u003e\u003ca href=\"https://samforeman.me\"\u003eSam Foreman \u003c/a\u003e\u003ca href=\"https://orcid.org/0000-0002-9981-0876\"\u003e\u003cspan class=\"orcid-green\"\u003e\u003ci class=\"ai  ai-orcid\"\u003e\u003c/i\u003e\u003c/span\u003e\u003c/a\u003e \u003ca href=\"mailto:foremans@anl.gov\" class=\"quarto-title-author-email\"\u003e\u003ci class=\"bi bi-envelope\"\u003e\u003c/i\u003e\u003c/a\u003e \u003c/p\u003e\n  \u003c/div\u003e\n  \u003cdiv class=\"quarto-title-meta-contents\"\u003e\n        \u003cp class=\"affiliation\"\u003e\n            \u003ca href=\"https://alcf.anl.gov/about/people/sam-foreman\"\u003e\n            Argonne National Laboratory\n            \u003c/a\u003e\n          \u003c/p\u003e\n      \u003c/div\u003e\n  \u003c/div\u003e\n\n\u003cdiv class=\"quarto-title-meta\"\u003e\n\n      \n    \u003cdiv\u003e\n    \u003cdiv class=\"quarto-title-meta-heading\"\u003e\u003c/div\u003e\n    \u003cdiv class=\"quarto-title-meta-contents\"\u003e\n      \u003cp class=\"date\"\u003eFebruary 13, 2024\u003c/p\u003e\n    \u003c/div\u003e\n  \u003c/div\u003e\n  \n    \u003cdiv\u003e\n    \u003cdiv class=\"quarto-title-meta-heading\"\u003e\u003c/div\u003e\n    \u003cdiv class=\"quarto-title-meta-contents\"\u003e\n      \u003cp class=\"date-modified\"\u003eFebruary 13, 2024\u003c/p\u003e\n    \u003c/div\u003e\n  \u003c/div\u003e\n    \n  \u003c/div\u003e\n  \n\n\n\u003c/header\u003e\n\n\n\u003csection id=\"creating-small-ish-llms\" class=\"level1\"\u003e\n\u003ch1\u003eCreating Small(-ish) LLMs\u003c/h1\u003e\n\u003cp\u003eSam Foreman \u003ca href=\"https://orcid.org/0000-0002-9981-0876\"\u003e\u003cspan class=\"orcid-green\"\u003e\u003c/span\u003e\u003c/a\u003e 2024-02-13\u003c/p\u003e\n\u003c/section\u003e\n\u003csection id=\"llms-from-scratch\" class=\"level1\"\u003e\n\u003ch1\u003eLLMs from Scratch\u003c/h1\u003e\n\u003cdiv\u003e\n\n\u003c/div\u003e\n\u003c/section\u003e\n\u003csection id=\"emergent-abilities\" class=\"level1\"\u003e\n\u003ch1\u003eEmergent Abilities\u003c/h1\u003e\n\u003cdiv width=\"66%\" style=\"text-align: center;\"\u003e\n\u003cp\u003e\u003cimg src=\"https://github.com/saforem2/llm-lunch-talk/blob/main/docs/assets/emergent-abilities.gif?raw=true\" height=\"75%\"\u003e\u003c/p\u003e\n\u003cp\u003e\u003ca href=\"https://arxiv.org/abs/2206.07682\"\u003eEmergent abilities of Large Language Models\u003c/a\u003e Yao et al.\u0026nbsp;(2023)\u003c/p\u003e\n\u003c/div\u003e\n\u003c/section\u003e\n\u003csection id=\"training-llms\" class=\"level1\"\u003e\n\u003ch1\u003eTraining LLMs\u003c/h1\u003e\n\u003cdiv\u003e\n\n\u003c/div\u003e\n\u003c/section\u003e\n\u003csection id=\"life-cycle-of-the-llm\" class=\"level1\"\u003e\n\u003ch1\u003eLife-Cycle of the LLM\u003c/h1\u003e\n\u003cdiv\u003e\n\n\u003c/div\u003e\n\u003c/section\u003e\n\u003csection id=\"forward-pass\" class=\"level1\"\u003e\n\u003ch1\u003eForward Pass\u003c/h1\u003e\n\u003cvideo data-autoplay=\"\" src=\"https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/blog/assisted-generation/gif_1_1080p.mov\"\u003e\n\u003c/video\u003e\n\u003c/section\u003e\n\u003csection id=\"generating-text\" class=\"level1\"\u003e\n\u003ch1\u003eGenerating Text\u003c/h1\u003e\n\u003cvideo data-autoplay=\"\" src=\"https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/blog/assisted-generation/gif_2_1080p.mov\"\u003e\n\u003c/video\u003e\n\u003c/section\u003e\n\u003csection id=\"life-cycle-of-the-llm-pre-training\" class=\"level1\"\u003e\n\u003ch1\u003eLife-Cycle of the LLM: Pre-training\u003c/h1\u003e\n\u003cp\u003e\u003cimg src=\"https://jalammar.github.io/images/gpt3/03-gpt3-training-step-back-prop.gif\" class=\"img-fluid\"\u003e\u003c/p\u003e\n\u003c/section\u003e\n\u003csection id=\"life-cycle-of-the-llm-fine-tuning\" class=\"level1\"\u003e\n\u003ch1\u003eLife-Cycle of the LLM: Fine-Tuning\u003c/h1\u003e\n\u003cp\u003e\u003cimg src=\"https://jalammar.github.io/images/gpt3/10-gpt3-fine-tuning.gif\" class=\"img-fluid\"\u003e\u003c/p\u003e\n\u003c/section\u003e\n\u003csection id=\"assistant-models\" class=\"level1\"\u003e\n\u003ch1\u003eAssistant Models\u003c/h1\u003e\n\u003cp\u003e\u003cspan class=\"preview-image\" style=\"text-align:center; margin-left:auto; margin-right: auto;\"\u003e\u003cimg src=\"https://github.com/saforem2/LLM-tutorial/blob/main/docs/assets/jailbreak.jpeg?raw=true\" class=\"img-fluid\"\u003e\u003c/span\u003e\u003c/p\u003e\n\u003c/section\u003e\n\u003csection id=\"saforem2wordplay\" class=\"level1\"\u003e\n\u003ch1\u003e\u003ca href=\"https://github.com/saforem2/wordplay\"\u003e\u003ccode\u003esaforem2/wordplay\u003c/code\u003e 🎮💬\u003c/a\u003e\u003c/h1\u003e\n\u003c!-- - [ `saforem2/wordplay`](https://github.com/saforem2/wordplay) --\u003e\n\u003cul\u003e\n\u003cli\u003eFork of Andrej Karpathy’s \u003ccode\u003enanoGPT\u003c/code\u003e\u003c/li\u003e\n\u003c/ul\u003e\n\u003cp\u003e\u003cimg src=\"https://github.com/saforem2/nanoGPT/raw/master/assets/nanogpt.jpg\" class=\"img-fluid\"\u003e\u003c/p\u003e\n\u003c/section\u003e\n\u003csection id=\"saforem2wordplay-1\" class=\"level1\"\u003e\n\u003ch1\u003e\u003ca href=\"https://github.com/saforem2/wordplay\"\u003e\u003ccode\u003esaforem2/wordplay\u003c/code\u003e 🎮💬\u003c/a\u003e\u003c/h1\u003e\n\u003cp\u003e\u003cimg src=\"https://github.com/saforem2/wordplay/blob/main/assets/car.png?raw=true\" data-ref-parent=\"fig-compare\" width=\"256\"\u003e\u003c/p\u003e\n\u003cp\u003e\u003cimg src=\"https://github.com/saforem2/wordplay/blob/main/assets/robot.png?raw=true\" data-ref-parent=\"fig-compare\" width=\"150\"\u003e\u003c/p\u003e\n\u003c/section\u003e\n\u003csection id=\"install\" class=\"level1\"\u003e\n\u003ch1\u003eInstall\u003c/h1\u003e\n\u003cdiv class=\"sourceCode\" id=\"cb1\"\u003e\u003cpre class=\"sourceCode bash\"\u003e\u003ccode class=\"sourceCode bash\"\u003e\u003cspan id=\"cb1-1\"\u003e\u003ca href=\"#cb1-1\" aria-hidden=\"true\" tabindex=\"-1\"\u003e\u003c/a\u003e\u003cspan class=\"ex\"\u003epython3\u003c/span\u003e \u003cspan class=\"at\"\u003e-m\u003c/span\u003e pip install \u003cspan class=\"st\"\u003e\"git+https://github.com/saforem2/wordplay.git\"\u003c/span\u003e\u003c/span\u003e\n\u003cspan id=\"cb1-2\"\u003e\u003ca href=\"#cb1-2\" aria-hidden=\"true\" tabindex=\"-1\"\u003e\u003c/a\u003e\u003cspan class=\"ex\"\u003epython3\u003c/span\u003e \u003cspan class=\"at\"\u003e-c\u003c/span\u003e \u003cspan class=\"st\"\u003e'import wordplay; print(wordplay.__file__)'\u003c/span\u003e\u003c/span\u003e\n\u003cspan id=\"cb1-3\"\u003e\u003ca href=\"#cb1-3\" aria-hidden=\"true\" tabindex=\"-1\"\u003e\u003c/a\u003e\u003cspan class=\"co\"\u003e# ./wordplay/src/wordplay/__init__.py\u003c/span\u003e\u003c/span\u003e\u003c/code\u003e\u003c/pre\u003e\u003c/div\u003e\n\u003c/section\u003e\n\u003csection id=\"dependencies\" class=\"level1\"\u003e\n\u003ch1\u003eDependencies\u003c/h1\u003e\n\u003cul\u003e\n\u003cli\u003e\u003ca href=\"https://github.com/huggingface/transformers\"\u003e\u003ccode\u003etransformers\u003c/code\u003e\u003c/a\u003e for transformers (to load \u003ccode\u003eGPT-2\u003c/code\u003e checkpoints)\u003c/li\u003e\n\u003cli\u003e\u003ca href=\"https://github.com/huggingface/datasets\"\u003e\u003ccode\u003edatasets\u003c/code\u003e\u003c/a\u003e for datasets (if you want to use OpenWebText)\u003c/li\u003e\n\u003cli\u003e\u003ca href=\"https://github.com/openai/tiktoken\"\u003e\u003ccode\u003etiktoken\u003c/code\u003e\u003c/a\u003e for OpenAI’s fast BPE code\u003c/li\u003e\n\u003cli\u003e\u003ca href=\"https://wandb.ai\"\u003e\u003ccode\u003ewandb\u003c/code\u003e\u003c/a\u003e for optional logging\u003c/li\u003e\n\u003cli\u003e\u003ca href=\"https://github.com/tqdm/tqdm\"\u003e\u003ccode\u003etqdm\u003c/code\u003e\u003c/a\u003e for progress bars\u003c/li\u003e\n\u003c/ul\u003e\n\u003c/section\u003e\n\u003csection id=\"quick-start\" class=\"level1\"\u003e\n\u003ch1\u003eQuick Start\u003c/h1\u003e\n\u003cul\u003e\n\u003cli\u003e\u003cp\u003eWe start with training a character-level GPT on the works of Shakespeare.\u003c/p\u003e\n\u003col type=\"1\"\u003e\n\u003cli\u003eDownloading the data (~ 1MB) file\u003c/li\u003e\n\u003cli\u003eConvert raw text to one large stream of integers\u003c/li\u003e\n\u003c/ol\u003e\n\u003cdiv class=\"sourceCode\" id=\"cb2\"\u003e\u003cpre class=\"sourceCode bash\"\u003e\u003ccode class=\"sourceCode bash\"\u003e\u003cspan id=\"cb2-1\"\u003e\u003ca href=\"#cb2-1\" aria-hidden=\"true\" tabindex=\"-1\"\u003e\u003c/a\u003e\u003cspan class=\"ex\"\u003epython3\u003c/span\u003e data/shakespeare_char/prepare.py\u003c/span\u003e\u003c/code\u003e\u003c/pre\u003e\u003c/div\u003e\n\u003cp\u003eThis will create \u003ccode\u003edata/shakespeare_char/{train.bin, val.bin}\u003c/code\u003e.\u003c/p\u003e\u003c/li\u003e\n\u003c/ul\u003e\n\u003c/section\u003e\n\u003csection id=\"model-model.py\" class=\"level1\"\u003e\n\u003ch1\u003eModel \u003ca href=\"https://github.com/saforem2/wordplay/blob/master/src/wordplay/model.py\"\u003e\u003ccode\u003emodel.py\u003c/code\u003e\u003c/a\u003e\u003c/h1\u003e\n\u003cdiv class=\"tabset-margin-container\"\u003e\u003c/div\u003e\u003cdiv class=\"panel-tabset\" style=\"font-size: 0.75em; width: 100%!important; height: 100%!important;\"\u003e\n\u003cul class=\"nav nav-tabs\" role=\"tablist\"\u003e\u003cli class=\"nav-item\" role=\"presentation\"\u003e\u003ca class=\"nav-link active\" id=\"tabset-1-1-tab\" data-bs-toggle=\"tab\" data-bs-target=\"#tabset-1-1\" role=\"tab\" aria-controls=\"tabset-1-1\" aria-selected=\"true\"\u003e\u003ccode\u003eCausalSelfAttention\u003c/code\u003e\u003c/a\u003e\u003c/li\u003e\u003cli class=\"nav-item\" role=\"presentation\"\u003e\u003ca class=\"nav-link\" id=\"tabset-1-2-tab\" data-bs-toggle=\"tab\" data-bs-target=\"#tabset-1-2\" role=\"tab\" aria-controls=\"tabset-1-2\" aria-selected=\"false\"\u003e\u003ccode\u003eLayerNorm\u003c/code\u003e\u003c/a\u003e\u003c/li\u003e\u003cli class=\"nav-item\" role=\"presentation\"\u003e\u003ca class=\"nav-link\" id=\"tabset-1-3-tab\" data-bs-toggle=\"tab\" data-bs-target=\"#tabset-1-3\" role=\"tab\" aria-controls=\"tabset-1-3\" aria-selected=\"false\"\u003e\u003ccode\u003eMLP\u003c/code\u003e\u003c/a\u003e\u003c/li\u003e\u003cli class=\"nav-item\" role=\"presentation\"\u003e\u003ca class=\"nav-link\" id=\"tabset-1-4-tab\" data-bs-toggle=\"tab\" data-bs-target=\"#tabset-1-4\" role=\"tab\" aria-controls=\"tabset-1-4\" aria-selected=\"false\"\u003e\u003ccode\u003eBlock\u003c/code\u003e\u003c/a\u003e\u003c/li\u003e\u003cli class=\"nav-item\" role=\"presentation\"\u003e\u003ca class=\"nav-link\" id=\"tabset-1-5-tab\" data-bs-toggle=\"tab\" data-bs-target=\"#tabset-1-5\" role=\"tab\" aria-controls=\"tabset-1-5\" aria-selected=\"false\"\u003e\u003ccode\u003eGPT\u003c/code\u003e\u003c/a\u003e\u003c/li\u003e\u003c/ul\u003e\n\u003cdiv class=\"tab-content\" style=\"font-size: 0.75em; width: 100%!important; height: 100%!important;\"\u003e\n\u003cdiv id=\"tabset-1-1\" class=\"tab-pane active\" role=\"tabpanel\" aria-labelledby=\"tabset-1-1-tab\"\u003e\n\u003cdiv class=\"sourceCode\" id=\"cb3\"\u003e\u003cpre class=\"sourceCode python\"\u003e\u003ccode class=\"sourceCode python\"\u003e\u003cspan id=\"cb3-1\"\u003e\u003ca href=\"#cb3-1\" aria-hidden=\"true\" tabindex=\"-1\"\u003e\u003c/a\u003e\u003c/span\u003e\n\u003cspan id=\"cb3-2\"\u003e\u003ca href=\"#cb3-2\" aria-hidden=\"true\" tabindex=\"-1\"\u003e\u003c/a\u003e\u003cspan class=\"kw\"\u003eclass\u003c/span\u003e CausalSelfAttention(nn.Module):\u003c/span\u003e\n\u003cspan id=\"cb3-3\"\u003e\u003ca href=\"#cb3-3\" aria-hidden=\"true\" tabindex=\"-1\"\u003e\u003c/a\u003e    \u003cspan class=\"kw\"\u003edef\u003c/span\u003e \u003cspan class=\"fu\"\u003e__init__\u003c/span\u003e(\u003cspan class=\"va\"\u003eself\u003c/span\u003e, config: GPTModelConfig):\u003c/span\u003e\n\u003cspan id=\"cb3-4\"\u003e\u003ca href=\"#cb3-4\" aria-hidden=\"true\" tabindex=\"-1\"\u003e\u003c/a\u003e        \u003cspan class=\"bu\"\u003esuper\u003c/span\u003e().\u003cspan class=\"fu\"\u003e__init__\u003c/span\u003e()\u003c/span\u003e\n\u003cspan id=\"cb3-5\"\u003e\u003ca href=\"#cb3-5\" aria-hidden=\"true\" tabindex=\"-1\"\u003e\u003c/a\u003e        \u003cspan class=\"cf\"\u003eassert\u003c/span\u003e config.n_embd \u003cspan class=\"op\"\u003e%\u003c/span\u003e config.n_head \u003cspan class=\"op\"\u003e==\u003c/span\u003e \u003cspan class=\"dv\"\u003e0\u003c/span\u003e\u003c/span\u003e\n\u003cspan id=\"cb3-6\"\u003e\u003ca href=\"#cb3-6\" aria-hidden=\"true\" tabindex=\"-1\"\u003e\u003c/a\u003e        \u003cspan class=\"co\"\u003e# key, query, value projections for all heads, but in a batch\u003c/span\u003e\u003c/span\u003e\n\u003cspan id=\"cb3-7\"\u003e\u003ca href=\"#cb3-7\" aria-hidden=\"true\" tabindex=\"-1\"\u003e\u003c/a\u003e        \u003cspan class=\"va\"\u003eself\u003c/span\u003e.c_attn \u003cspan class=\"op\"\u003e=\u003c/span\u003e nn.Linear(\u003c/span\u003e\n\u003cspan id=\"cb3-8\"\u003e\u003ca href=\"#cb3-8\" aria-hidden=\"true\" tabindex=\"-1\"\u003e\u003c/a\u003e            config.n_embd,\u003c/span\u003e\n\u003cspan id=\"cb3-9\"\u003e\u003ca href=\"#cb3-9\" aria-hidden=\"true\" tabindex=\"-1\"\u003e\u003c/a\u003e            \u003cspan class=\"dv\"\u003e3\u003c/span\u003e \u003cspan class=\"op\"\u003e*\u003c/span\u003e config.n_embd,\u003c/span\u003e\n\u003cspan id=\"cb3-10\"\u003e\u003ca href=\"#cb3-10\" aria-hidden=\"true\" tabindex=\"-1\"\u003e\u003c/a\u003e            bias\u003cspan class=\"op\"\u003e=\u003c/span\u003econfig.bias\u003c/span\u003e\n\u003cspan id=\"cb3-11\"\u003e\u003ca href=\"#cb3-11\" aria-hidden=\"true\" tabindex=\"-1\"\u003e\u003c/a\u003e        )\u003c/span\u003e\n\u003cspan id=\"cb3-12\"\u003e\u003ca href=\"#cb3-12\" aria-hidden=\"true\" tabindex=\"-1\"\u003e\u003c/a\u003e        \u003cspan class=\"co\"\u003e# output projection\u003c/span\u003e\u003c/span\u003e\n\u003cspan id=\"cb3-13\"\u003e\u003ca href=\"#cb3-13\" aria-hidden=\"true\" tabindex=\"-1\"\u003e\u003c/a\u003e        \u003cspan class=\"va\"\u003eself\u003c/span\u003e.c_proj \u003cspan class=\"op\"\u003e=\u003c/span\u003e nn.Linear(\u003c/span\u003e\n\u003cspan id=\"cb3-14\"\u003e\u003ca href=\"#cb3-14\" aria-hidden=\"true\" tabindex=\"-1\"\u003e\u003c/a\u003e            config.n_embd,\u003c/span\u003e\n\u003cspan id=\"cb3-15\"\u003e\u003ca href=\"#cb3-15\" aria-hidden=\"true\" tabindex=\"-1\"\u003e\u003c/a\u003e            config.n_embd,\u003c/span\u003e\n\u003cspan id=\"cb3-16\"\u003e\u003ca href=\"#cb3-16\" aria-hidden=\"true\" tabindex=\"-1\"\u003e\u003c/a\u003e            bias\u003cspan class=\"op\"\u003e=\u003c/span\u003econfig.bias\u003c/span\u003e\n\u003cspan id=\"cb3-17\"\u003e\u003ca href=\"#cb3-17\" aria-hidden=\"true\" tabindex=\"-1\"\u003e\u003c/a\u003e        )\u003c/span\u003e\n\u003cspan id=\"cb3-18\"\u003e\u003ca href=\"#cb3-18\" aria-hidden=\"true\" tabindex=\"-1\"\u003e\u003c/a\u003e        \u003cspan class=\"co\"\u003e# regularization\u003c/span\u003e\u003c/span\u003e\n\u003cspan id=\"cb3-19\"\u003e\u003ca href=\"#cb3-19\" aria-hidden=\"true\" tabindex=\"-1\"\u003e\u003c/a\u003e        \u003cspan class=\"va\"\u003eself\u003c/span\u003e.attn_dropout \u003cspan class=\"op\"\u003e=\u003c/span\u003e nn.Dropout(config.dropout)\u003c/span\u003e\n\u003cspan id=\"cb3-20\"\u003e\u003ca href=\"#cb3-20\" aria-hidden=\"true\" tabindex=\"-1\"\u003e\u003c/a\u003e        \u003cspan class=\"va\"\u003eself\u003c/span\u003e.resid_dropout \u003cspan class=\"op\"\u003e=\u003c/span\u003e nn.Dropout(config.dropout)\u003c/span\u003e\n\u003cspan id=\"cb3-21\"\u003e\u003ca href=\"#cb3-21\" aria-hidden=\"true\" tabindex=\"-1\"\u003e\u003c/a\u003e        \u003cspan class=\"va\"\u003eself\u003c/span\u003e.n_head \u003cspan class=\"op\"\u003e=\u003c/span\u003e config.n_head\u003c/span\u003e\n\u003cspan id=\"cb3-22\"\u003e\u003ca href=\"#cb3-22\" aria-hidden=\"true\" tabindex=\"-1\"\u003e\u003c/a\u003e        \u003cspan class=\"va\"\u003eself\u003c/span\u003e.n_embd \u003cspan class=\"op\"\u003e=\u003c/span\u003e config.n_embd\u003c/span\u003e\n\u003cspan id=\"cb3-23\"\u003e\u003ca href=\"#cb3-23\" aria-hidden=\"true\" tabindex=\"-1\"\u003e\u003c/a\u003e        \u003cspan class=\"va\"\u003eself\u003c/span\u003e.dropout \u003cspan class=\"op\"\u003e=\u003c/span\u003e config.dropout\u003c/span\u003e\n\u003cspan id=\"cb3-24\"\u003e\u003ca href=\"#cb3-24\" aria-hidden=\"true\" tabindex=\"-1\"\u003e\u003c/a\u003e        \u003cspan class=\"co\"\u003e# flash attention make GPU go brrrrr but support is only in\u003c/span\u003e\u003c/span\u003e\n\u003cspan id=\"cb3-25\"\u003e\u003ca href=\"#cb3-25\" aria-hidden=\"true\" tabindex=\"-1\"\u003e\u003c/a\u003e        \u003cspan class=\"co\"\u003e# PyTorch \u0026gt;= 2.0\u003c/span\u003e\u003c/span\u003e\n\u003cspan id=\"cb3-26\"\u003e\u003ca href=\"#cb3-26\" aria-hidden=\"true\" tabindex=\"-1\"\u003e\u003c/a\u003e        \u003cspan class=\"va\"\u003eself\u003c/span\u003e.flash \u003cspan class=\"op\"\u003e=\u003c/span\u003e \u003cspan class=\"bu\"\u003ehasattr\u003c/span\u003e(\u003c/span\u003e\n\u003cspan id=\"cb3-27\"\u003e\u003ca href=\"#cb3-27\" aria-hidden=\"true\" tabindex=\"-1\"\u003e\u003c/a\u003e            torch.nn.functional,\u003c/span\u003e\n\u003cspan id=\"cb3-28\"\u003e\u003ca href=\"#cb3-28\" aria-hidden=\"true\" tabindex=\"-1\"\u003e\u003c/a\u003e            \u003cspan class=\"st\"\u003e'scaled_dot_product_attention'\u003c/span\u003e\u003c/span\u003e\n\u003cspan id=\"cb3-29\"\u003e\u003ca href=\"#cb3-29\" aria-hidden=\"true\" tabindex=\"-1\"\u003e\u003c/a\u003e        )\u003c/span\u003e\n\u003cspan id=\"cb3-30\"\u003e\u003ca href=\"#cb3-30\" aria-hidden=\"true\" tabindex=\"-1\"\u003e\u003c/a\u003e        \u003cspan class=\"co\"\u003e# if self.flash and RANK == 0:\u003c/span\u003e\u003c/span\u003e\n\u003cspan id=\"cb3-31\"\u003e\u003ca href=\"#cb3-31\" aria-hidden=\"true\" tabindex=\"-1\"\u003e\u003c/a\u003e        \u003cspan class=\"co\"\u003e#     log.warning(\u003c/span\u003e\u003c/span\u003e\n\u003cspan id=\"cb3-32\"\u003e\u003ca href=\"#cb3-32\" aria-hidden=\"true\" tabindex=\"-1\"\u003e\u003c/a\u003e        \u003cspan class=\"co\"\u003e#         f'Using torch.nn.functional.scaled_dot_product_attention'\u003c/span\u003e\u003c/span\u003e\n\u003cspan id=\"cb3-33\"\u003e\u003ca href=\"#cb3-33\" aria-hidden=\"true\" tabindex=\"-1\"\u003e\u003c/a\u003e        \u003cspan class=\"co\"\u003e#         '(Flash Attn)'\u003c/span\u003e\u003c/span\u003e\n\u003cspan id=\"cb3-34\"\u003e\u003ca href=\"#cb3-34\" aria-hidden=\"true\" tabindex=\"-1\"\u003e\u003c/a\u003e        \u003cspan class=\"co\"\u003e#     )\u003c/span\u003e\u003c/span\u003e\n\u003cspan id=\"cb3-35\"\u003e\u003ca href=\"#cb3-35\" aria-hidden=\"true\" tabindex=\"-1\"\u003e\u003c/a\u003e        \u003cspan class=\"cf\"\u003eif\u003c/span\u003e \u003cspan class=\"kw\"\u003enot\u003c/span\u003e \u003cspan class=\"va\"\u003eself\u003c/span\u003e.flash:\u003c/span\u003e\n\u003cspan id=\"cb3-36\"\u003e\u003ca href=\"#cb3-36\" aria-hidden=\"true\" tabindex=\"-1\"\u003e\u003c/a\u003e            log.warning(\u003c/span\u003e\n\u003cspan id=\"cb3-37\"\u003e\u003ca href=\"#cb3-37\" aria-hidden=\"true\" tabindex=\"-1\"\u003e\u003c/a\u003e                \u003cspan class=\"st\"\u003e\"WARNING: using slow attention.\"\u003c/span\u003e\u003c/span\u003e\n\u003cspan id=\"cb3-38\"\u003e\u003ca href=\"#cb3-38\" aria-hidden=\"true\" tabindex=\"-1\"\u003e\u003c/a\u003e                \u003cspan class=\"st\"\u003e\"Flash Attention requires PyTorch \u0026gt;= 2.0\"\u003c/span\u003e\u003c/span\u003e\n\u003cspan id=\"cb3-39\"\u003e\u003ca href=\"#cb3-39\" aria-hidden=\"true\" tabindex=\"-1\"\u003e\u003c/a\u003e            )\u003c/span\u003e\n\u003cspan id=\"cb3-40\"\u003e\u003ca href=\"#cb3-40\" aria-hidden=\"true\" tabindex=\"-1\"\u003e\u003c/a\u003e            \u003cspan class=\"co\"\u003e# causal mask to ensure that attention is only applied to the left\u003c/span\u003e\u003c/span\u003e\n\u003cspan id=\"cb3-41\"\u003e\u003ca href=\"#cb3-41\" aria-hidden=\"true\" tabindex=\"-1\"\u003e\u003c/a\u003e            \u003cspan class=\"co\"\u003e# in the input sequence\u003c/span\u003e\u003c/span\u003e\n\u003cspan id=\"cb3-42\"\u003e\u003ca href=\"#cb3-42\" aria-hidden=\"true\" tabindex=\"-1\"\u003e\u003c/a\u003e            \u003cspan class=\"va\"\u003eself\u003c/span\u003e.register_buffer(\u003c/span\u003e\n\u003cspan id=\"cb3-43\"\u003e\u003ca href=\"#cb3-43\" aria-hidden=\"true\" tabindex=\"-1\"\u003e\u003c/a\u003e                \u003cspan class=\"st\"\u003e\"bias\"\u003c/span\u003e,\u003c/span\u003e\n\u003cspan id=\"cb3-44\"\u003e\u003ca href=\"#cb3-44\" aria-hidden=\"true\" tabindex=\"-1\"\u003e\u003c/a\u003e                torch.tril(\u003c/span\u003e\n\u003cspan id=\"cb3-45\"\u003e\u003ca href=\"#cb3-45\" aria-hidden=\"true\" tabindex=\"-1\"\u003e\u003c/a\u003e                    torch.ones(\u003c/span\u003e\n\u003cspan id=\"cb3-46\"\u003e\u003ca href=\"#cb3-46\" aria-hidden=\"true\" tabindex=\"-1\"\u003e\u003c/a\u003e                        config.block_size,\u003c/span\u003e\n\u003cspan id=\"cb3-47\"\u003e\u003ca href=\"#cb3-47\" aria-hidden=\"true\" tabindex=\"-1\"\u003e\u003c/a\u003e                        config.block_size\u003c/span\u003e\n\u003cspan id=\"cb3-48\"\u003e\u003ca href=\"#cb3-48\" aria-hidden=\"true\" tabindex=\"-1\"\u003e\u003c/a\u003e                    )\u003c/span\u003e\n\u003cspan id=\"cb3-49\"\u003e\u003ca href=\"#cb3-49\" aria-hidden=\"true\" tabindex=\"-1\"\u003e\u003c/a\u003e                ).view(\u003cspan class=\"dv\"\u003e1\u003c/span\u003e, \u003cspan class=\"dv\"\u003e1\u003c/span\u003e, config.block_size, config.block_size)\u003c/span\u003e\n\u003cspan id=\"cb3-50\"\u003e\u003ca href=\"#cb3-50\" aria-hidden=\"true\" tabindex=\"-1\"\u003e\u003c/a\u003e            )\u003c/span\u003e\n\u003cspan id=\"cb3-51\"\u003e\u003ca href=\"#cb3-51\" aria-hidden=\"true\" tabindex=\"-1\"\u003e\u003c/a\u003e\u003c/span\u003e\n\u003cspan id=\"cb3-52\"\u003e\u003ca href=\"#cb3-52\" aria-hidden=\"true\" tabindex=\"-1\"\u003e\u003c/a\u003e    \u003cspan class=\"kw\"\u003edef\u003c/span\u003e forward(\u003cspan class=\"va\"\u003eself\u003c/span\u003e, x):\u003c/span\u003e\n\u003cspan id=\"cb3-53\"\u003e\u003ca href=\"#cb3-53\" aria-hidden=\"true\" tabindex=\"-1\"\u003e\u003c/a\u003e        \u003cspan class=\"co\"\u003e# batch size, sequence length, embedding dimensionality (n_embd)\u003c/span\u003e\u003c/span\u003e\n\u003cspan id=\"cb3-54\"\u003e\u003ca href=\"#cb3-54\" aria-hidden=\"true\" tabindex=\"-1\"\u003e\u003c/a\u003e        B, T, C \u003cspan class=\"op\"\u003e=\u003c/span\u003e x.size()\u003c/span\u003e\n\u003cspan id=\"cb3-55\"\u003e\u003ca href=\"#cb3-55\" aria-hidden=\"true\" tabindex=\"-1\"\u003e\u003c/a\u003e\u003c/span\u003e\n\u003cspan id=\"cb3-56\"\u003e\u003ca href=\"#cb3-56\" aria-hidden=\"true\" tabindex=\"-1\"\u003e\u003c/a\u003e        \u003cspan class=\"co\"\u003e# calculate query, key, values for all heads in batch and move head\u003c/span\u003e\u003c/span\u003e\n\u003cspan id=\"cb3-57\"\u003e\u003ca href=\"#cb3-57\" aria-hidden=\"true\" tabindex=\"-1\"\u003e\u003c/a\u003e        \u003cspan class=\"co\"\u003e# forward to be the batch dim\u003c/span\u003e\u003c/span\u003e\n\u003cspan id=\"cb3-58\"\u003e\u003ca href=\"#cb3-58\" aria-hidden=\"true\" tabindex=\"-1\"\u003e\u003c/a\u003e        q, k, v \u003cspan class=\"op\"\u003e=\u003c/span\u003e \u003cspan class=\"va\"\u003eself\u003c/span\u003e.c_attn(x).split(\u003cspan class=\"va\"\u003eself\u003c/span\u003e.n_embd, dim\u003cspan class=\"op\"\u003e=\u003c/span\u003e\u003cspan class=\"dv\"\u003e2\u003c/span\u003e)\u003c/span\u003e\n\u003cspan id=\"cb3-59\"\u003e\u003ca href=\"#cb3-59\" aria-hidden=\"true\" tabindex=\"-1\"\u003e\u003c/a\u003e        \u003cspan class=\"co\"\u003e# (B, nh, T, hs)\u003c/span\u003e\u003c/span\u003e\n\u003cspan id=\"cb3-60\"\u003e\u003ca href=\"#cb3-60\" aria-hidden=\"true\" tabindex=\"-1\"\u003e\u003c/a\u003e        k \u003cspan class=\"op\"\u003e=\u003c/span\u003e k.view(B, T, \u003cspan class=\"va\"\u003eself\u003c/span\u003e.n_head, C \u003cspan class=\"op\"\u003e//\u003c/span\u003e \u003cspan class=\"va\"\u003eself\u003c/span\u003e.n_head).transpose(\u003cspan class=\"dv\"\u003e1\u003c/span\u003e, \u003cspan class=\"dv\"\u003e2\u003c/span\u003e)\u003c/span\u003e\n\u003cspan id=\"cb3-61\"\u003e\u003ca href=\"#cb3-61\" aria-hidden=\"true\" tabindex=\"-1\"\u003e\u003c/a\u003e        \u003cspan class=\"co\"\u003e# (B, nh, T, hs)\u003c/span\u003e\u003c/span\u003e\n\u003cspan id=\"cb3-62\"\u003e\u003ca href=\"#cb3-62\" aria-hidden=\"true\" tabindex=\"-1\"\u003e\u003c/a\u003e        q \u003cspan class=\"op\"\u003e=\u003c/span\u003e q.view(B, T, \u003cspan class=\"va\"\u003eself\u003c/span\u003e.n_head, C \u003cspan class=\"op\"\u003e//\u003c/span\u003e \u003cspan class=\"va\"\u003eself\u003c/span\u003e.n_head).transpose(\u003cspan class=\"dv\"\u003e1\u003c/span\u003e, \u003cspan class=\"dv\"\u003e2\u003c/span\u003e)\u003c/span\u003e\n\u003cspan id=\"cb3-63\"\u003e\u003ca href=\"#cb3-63\" aria-hidden=\"true\" tabindex=\"-1\"\u003e\u003c/a\u003e        \u003cspan class=\"co\"\u003e# (B, nh, T, hs)\u003c/span\u003e\u003c/span\u003e\n\u003cspan id=\"cb3-64\"\u003e\u003ca href=\"#cb3-64\" aria-hidden=\"true\" tabindex=\"-1\"\u003e\u003c/a\u003e        v \u003cspan class=\"op\"\u003e=\u003c/span\u003e v.view(B, T, \u003cspan class=\"va\"\u003eself\u003c/span\u003e.n_head, C \u003cspan class=\"op\"\u003e//\u003c/span\u003e \u003cspan class=\"va\"\u003eself\u003c/span\u003e.n_head).transpose(\u003cspan class=\"dv\"\u003e1\u003c/span\u003e, \u003cspan class=\"dv\"\u003e2\u003c/span\u003e)\u003c/span\u003e\n\u003cspan id=\"cb3-65\"\u003e\u003ca href=\"#cb3-65\" aria-hidden=\"true\" tabindex=\"-1\"\u003e\u003c/a\u003e        \u003cspan class=\"co\"\u003e# causal self-attention; Self-attend:\u003c/span\u003e\u003c/span\u003e\n\u003cspan id=\"cb3-66\"\u003e\u003ca href=\"#cb3-66\" aria-hidden=\"true\" tabindex=\"-1\"\u003e\u003c/a\u003e        \u003cspan class=\"co\"\u003e# (B, nh, T, hs) x (B, nh, hs, T) -\u0026gt; (B, nh, T, T)\u003c/span\u003e\u003c/span\u003e\n\u003cspan id=\"cb3-67\"\u003e\u003ca href=\"#cb3-67\" aria-hidden=\"true\" tabindex=\"-1\"\u003e\u003c/a\u003e        \u003cspan class=\"cf\"\u003eif\u003c/span\u003e \u003cspan class=\"va\"\u003eself\u003c/span\u003e.flash:\u003c/span\u003e\n\u003cspan id=\"cb3-68\"\u003e\u003ca href=\"#cb3-68\" aria-hidden=\"true\" tabindex=\"-1\"\u003e\u003c/a\u003e            \u003cspan class=\"co\"\u003e# efficient attention using Flash Attention CUDA kernels\u003c/span\u003e\u003c/span\u003e\n\u003cspan id=\"cb3-69\"\u003e\u003ca href=\"#cb3-69\" aria-hidden=\"true\" tabindex=\"-1\"\u003e\u003c/a\u003e            y \u003cspan class=\"op\"\u003e=\u003c/span\u003e torch.nn.functional.scaled_dot_product_attention(\u003c/span\u003e\n\u003cspan id=\"cb3-70\"\u003e\u003ca href=\"#cb3-70\" aria-hidden=\"true\" tabindex=\"-1\"\u003e\u003c/a\u003e                q,\u003c/span\u003e\n\u003cspan id=\"cb3-71\"\u003e\u003ca href=\"#cb3-71\" aria-hidden=\"true\" tabindex=\"-1\"\u003e\u003c/a\u003e                k,\u003c/span\u003e\n\u003cspan id=\"cb3-72\"\u003e\u003ca href=\"#cb3-72\" aria-hidden=\"true\" tabindex=\"-1\"\u003e\u003c/a\u003e                v,\u003c/span\u003e\n\u003cspan id=\"cb3-73\"\u003e\u003ca href=\"#cb3-73\" aria-hidden=\"true\" tabindex=\"-1\"\u003e\u003c/a\u003e                attn_mask\u003cspan class=\"op\"\u003e=\u003c/span\u003e\u003cspan class=\"va\"\u003eNone\u003c/span\u003e,\u003c/span\u003e\n\u003cspan id=\"cb3-74\"\u003e\u003ca href=\"#cb3-74\" aria-hidden=\"true\" tabindex=\"-1\"\u003e\u003c/a\u003e                dropout_p\u003cspan class=\"op\"\u003e=\u003c/span\u003e(\u003cspan class=\"va\"\u003eself\u003c/span\u003e.dropout \u003cspan class=\"cf\"\u003eif\u003c/span\u003e \u003cspan class=\"va\"\u003eself\u003c/span\u003e.training \u003cspan class=\"cf\"\u003eelse\u003c/span\u003e \u003cspan class=\"dv\"\u003e0\u003c/span\u003e),\u003c/span\u003e\n\u003cspan id=\"cb3-75\"\u003e\u003ca href=\"#cb3-75\" aria-hidden=\"true\" tabindex=\"-1\"\u003e\u003c/a\u003e                is_causal\u003cspan class=\"op\"\u003e=\u003c/span\u003e\u003cspan class=\"va\"\u003eTrue\u003c/span\u003e\u003c/span\u003e\n\u003cspan id=\"cb3-76\"\u003e\u003ca href=\"#cb3-76\" aria-hidden=\"true\" tabindex=\"-1\"\u003e\u003c/a\u003e            )\u003c/span\u003e\n\u003cspan id=\"cb3-77\"\u003e\u003ca href=\"#cb3-77\" aria-hidden=\"true\" tabindex=\"-1\"\u003e\u003c/a\u003e        \u003cspan class=\"cf\"\u003eelse\u003c/span\u003e:\u003c/span\u003e\n\u003cspan id=\"cb3-78\"\u003e\u003ca href=\"#cb3-78\" aria-hidden=\"true\" tabindex=\"-1\"\u003e\u003c/a\u003e            \u003cspan class=\"co\"\u003e# manual implementation of attention\u003c/span\u003e\u003c/span\u003e\n\u003cspan id=\"cb3-79\"\u003e\u003ca href=\"#cb3-79\" aria-hidden=\"true\" tabindex=\"-1\"\u003e\u003c/a\u003e            att \u003cspan class=\"op\"\u003e=\u003c/span\u003e (q \u003cspan class=\"op\"\u003e@\u003c/span\u003e k.transpose(\u003cspan class=\"op\"\u003e-\u003c/span\u003e\u003cspan class=\"dv\"\u003e2\u003c/span\u003e, \u003cspan class=\"op\"\u003e-\u003c/span\u003e\u003cspan class=\"dv\"\u003e1\u003c/span\u003e)) \u003cspan class=\"op\"\u003e*\u003c/span\u003e (\u003cspan class=\"fl\"\u003e1.0\u003c/span\u003e \u003cspan class=\"op\"\u003e/\u003c/span\u003e math.sqrt(k.size(\u003cspan class=\"op\"\u003e-\u003c/span\u003e\u003cspan class=\"dv\"\u003e1\u003c/span\u003e)))\u003c/span\u003e\n\u003cspan id=\"cb3-80\"\u003e\u003ca href=\"#cb3-80\" aria-hidden=\"true\" tabindex=\"-1\"\u003e\u003c/a\u003e            att \u003cspan class=\"op\"\u003e=\u003c/span\u003e att.masked_fill(\u003c/span\u003e\n\u003cspan id=\"cb3-81\"\u003e\u003ca href=\"#cb3-81\" aria-hidden=\"true\" tabindex=\"-1\"\u003e\u003c/a\u003e                \u003cspan class=\"va\"\u003eself\u003c/span\u003e.bias[:, :, :T, :T] \u003cspan class=\"op\"\u003e==\u003c/span\u003e \u003cspan class=\"dv\"\u003e0\u003c/span\u003e,  \u003cspan class=\"co\"\u003e# type:ignore\u003c/span\u003e\u003c/span\u003e\n\u003cspan id=\"cb3-82\"\u003e\u003ca href=\"#cb3-82\" aria-hidden=\"true\" tabindex=\"-1\"\u003e\u003c/a\u003e                \u003cspan class=\"bu\"\u003efloat\u003c/span\u003e(\u003cspan class=\"st\"\u003e'-inf'\u003c/span\u003e)\u003c/span\u003e\n\u003cspan id=\"cb3-83\"\u003e\u003ca href=\"#cb3-83\" aria-hidden=\"true\" tabindex=\"-1\"\u003e\u003c/a\u003e            )\u003c/span\u003e\n\u003cspan id=\"cb3-84\"\u003e\u003ca href=\"#cb3-84\" aria-hidden=\"true\" tabindex=\"-1\"\u003e\u003c/a\u003e            att \u003cspan class=\"op\"\u003e=\u003c/span\u003e F.softmax(att, dim\u003cspan class=\"op\"\u003e=-\u003c/span\u003e\u003cspan class=\"dv\"\u003e1\u003c/span\u003e)\u003c/span\u003e\n\u003cspan id=\"cb3-85\"\u003e\u003ca href=\"#cb3-85\" aria-hidden=\"true\" tabindex=\"-1\"\u003e\u003c/a\u003e            att \u003cspan class=\"op\"\u003e=\u003c/span\u003e \u003cspan class=\"va\"\u003eself\u003c/span\u003e.attn_dropout(att)\u003c/span\u003e\n\u003cspan id=\"cb3-86\"\u003e\u003ca href=\"#cb3-86\" aria-hidden=\"true\" tabindex=\"-1\"\u003e\u003c/a\u003e            y \u003cspan class=\"op\"\u003e=\u003c/span\u003e att \u003cspan class=\"op\"\u003e@\u003c/span\u003e v  \u003cspan class=\"co\"\u003e# (B, nh, T, T) x (B, nh, T, hs) -\u0026gt; (B, nh, T, hs)\u003c/span\u003e\u003c/span\u003e\n\u003cspan id=\"cb3-87\"\u003e\u003ca href=\"#cb3-87\" aria-hidden=\"true\" tabindex=\"-1\"\u003e\u003c/a\u003e        \u003cspan class=\"co\"\u003e# re-assemble all head outputs side by side\u003c/span\u003e\u003c/span\u003e\n\u003cspan id=\"cb3-88\"\u003e\u003ca href=\"#cb3-88\" aria-hidden=\"true\" tabindex=\"-1\"\u003e\u003c/a\u003e        y \u003cspan class=\"op\"\u003e=\u003c/span\u003e y.transpose(\u003cspan class=\"dv\"\u003e1\u003c/span\u003e, \u003cspan class=\"dv\"\u003e2\u003c/span\u003e).contiguous().view(B, T, C)\u003c/span\u003e\n\u003cspan id=\"cb3-89\"\u003e\u003ca href=\"#cb3-89\" aria-hidden=\"true\" tabindex=\"-1\"\u003e\u003c/a\u003e\u003c/span\u003e\n\u003cspan id=\"cb3-90\"\u003e\u003ca href=\"#cb3-90\" aria-hidden=\"true\" tabindex=\"-1\"\u003e\u003c/a\u003e        \u003cspan class=\"co\"\u003e# output projection\u003c/span\u003e\u003c/span\u003e\n\u003cspan id=\"cb3-91\"\u003e\u003ca href=\"#cb3-91\" aria-hidden=\"true\" tabindex=\"-1\"\u003e\u003c/a\u003e        y \u003cspan class=\"op\"\u003e=\u003c/span\u003e \u003cspan class=\"va\"\u003eself\u003c/span\u003e.resid_dropout(\u003cspan class=\"va\"\u003eself\u003c/span\u003e.c_proj(y))\u003c/span\u003e\n\u003cspan id=\"cb3-92\"\u003e\u003ca href=\"#cb3-92\" aria-hidden=\"true\" tabindex=\"-1\"\u003e\u003c/a\u003e        \u003cspan class=\"cf\"\u003ereturn\u003c/span\u003e y\u003c/span\u003e\u003c/code\u003e\u003c/pre\u003e\u003c/div\u003e\n\u003c/div\u003e\n\u003cdiv id=\"tabset-1-2\" class=\"tab-pane\" role=\"tabpanel\" aria-labelledby=\"tabset-1-2-tab\"\u003e\n\u003cdiv class=\"sourceCode\" id=\"cb4\"\u003e\u003cpre class=\"sourceCode python\"\u003e\u003ccode class=\"sourceCode python\"\u003e\u003cspan id=\"cb4-1\"\u003e\u003ca href=\"#cb4-1\" aria-hidden=\"true\" tabindex=\"-1\"\u003e\u003c/a\u003e\u003cspan class=\"kw\"\u003eclass\u003c/span\u003e LayerNorm(nn.Module):\u003c/span\u003e\n\u003cspan id=\"cb4-2\"\u003e\u003ca href=\"#cb4-2\" aria-hidden=\"true\" tabindex=\"-1\"\u003e\u003c/a\u003e    \u003cspan class=\"co\"\u003e\"\"\"\u003c/span\u003e\u003c/span\u003e\n\u003cspan id=\"cb4-3\"\u003e\u003ca href=\"#cb4-3\" aria-hidden=\"true\" tabindex=\"-1\"\u003e\u003c/a\u003e\u003cspan class=\"co\"\u003e    LayerNorm but with an optional bias.\u003c/span\u003e\u003c/span\u003e\n\u003cspan id=\"cb4-4\"\u003e\u003ca href=\"#cb4-4\" aria-hidden=\"true\" tabindex=\"-1\"\u003e\u003c/a\u003e\u003c/span\u003e\n\u003cspan id=\"cb4-5\"\u003e\u003ca href=\"#cb4-5\" aria-hidden=\"true\" tabindex=\"-1\"\u003e\u003c/a\u003e\u003cspan class=\"co\"\u003e    (PyTorch doesn't support simply bias=False)\u003c/span\u003e\u003c/span\u003e\n\u003cspan id=\"cb4-6\"\u003e\u003ca href=\"#cb4-6\" aria-hidden=\"true\" tabindex=\"-1\"\u003e\u003c/a\u003e\u003cspan class=\"co\"\u003e    \"\"\"\u003c/span\u003e\u003c/span\u003e\n\u003cspan id=\"cb4-7\"\u003e\u003ca href=\"#cb4-7\" aria-hidden=\"true\" tabindex=\"-1\"\u003e\u003c/a\u003e\u003c/span\u003e\n\u003cspan id=\"cb4-8\"\u003e\u003ca href=\"#cb4-8\" aria-hidden=\"true\" tabindex=\"-1\"\u003e\u003c/a\u003e    \u003cspan class=\"kw\"\u003edef\u003c/span\u003e \u003cspan class=\"fu\"\u003e__init__\u003c/span\u003e(\u003cspan class=\"va\"\u003eself\u003c/span\u003e, ndim, bias):\u003c/span\u003e\n\u003cspan id=\"cb4-9\"\u003e\u003ca href=\"#cb4-9\" aria-hidden=\"true\" tabindex=\"-1\"\u003e\u003c/a\u003e        \u003cspan class=\"bu\"\u003esuper\u003c/span\u003e().\u003cspan class=\"fu\"\u003e__init__\u003c/span\u003e()\u003c/span\u003e\n\u003cspan id=\"cb4-10\"\u003e\u003ca href=\"#cb4-10\" aria-hidden=\"true\" tabindex=\"-1\"\u003e\u003c/a\u003e        \u003cspan class=\"va\"\u003eself\u003c/span\u003e.weight \u003cspan class=\"op\"\u003e=\u003c/span\u003e nn.Parameter(torch.ones(ndim))\u003c/span\u003e\n\u003cspan id=\"cb4-11\"\u003e\u003ca href=\"#cb4-11\" aria-hidden=\"true\" tabindex=\"-1\"\u003e\u003c/a\u003e        \u003cspan class=\"va\"\u003eself\u003c/span\u003e.bias \u003cspan class=\"op\"\u003e=\u003c/span\u003e nn.Parameter(torch.zeros(ndim)) \u003cspan class=\"cf\"\u003eif\u003c/span\u003e bias \u003cspan class=\"cf\"\u003eelse\u003c/span\u003e \u003cspan class=\"va\"\u003eNone\u003c/span\u003e\u003c/span\u003e\n\u003cspan id=\"cb4-12\"\u003e\u003ca href=\"#cb4-12\" aria-hidden=\"true\" tabindex=\"-1\"\u003e\u003c/a\u003e\u003c/span\u003e\n\u003cspan id=\"cb4-13\"\u003e\u003ca href=\"#cb4-13\" aria-hidden=\"true\" tabindex=\"-1\"\u003e\u003c/a\u003e    \u003cspan class=\"kw\"\u003edef\u003c/span\u003e forward(\u003cspan class=\"va\"\u003eself\u003c/span\u003e, \u003cspan class=\"bu\"\u003einput\u003c/span\u003e):\u003c/span\u003e\n\u003cspan id=\"cb4-14\"\u003e\u003ca href=\"#cb4-14\" aria-hidden=\"true\" tabindex=\"-1\"\u003e\u003c/a\u003e        \u003cspan class=\"cf\"\u003ereturn\u003c/span\u003e F.layer_norm(\u003c/span\u003e\n\u003cspan id=\"cb4-15\"\u003e\u003ca href=\"#cb4-15\" aria-hidden=\"true\" tabindex=\"-1\"\u003e\u003c/a\u003e            \u003cspan class=\"bu\"\u003einput\u003c/span\u003e,\u003c/span\u003e\n\u003cspan id=\"cb4-16\"\u003e\u003ca href=\"#cb4-16\" aria-hidden=\"true\" tabindex=\"-1\"\u003e\u003c/a\u003e            \u003cspan class=\"va\"\u003eself\u003c/span\u003e.weight.shape,\u003c/span\u003e\n\u003cspan id=\"cb4-17\"\u003e\u003ca href=\"#cb4-17\" aria-hidden=\"true\" tabindex=\"-1\"\u003e\u003c/a\u003e            \u003cspan class=\"va\"\u003eself\u003c/span\u003e.weight,\u003c/span\u003e\n\u003cspan id=\"cb4-18\"\u003e\u003ca href=\"#cb4-18\" aria-hidden=\"true\" tabindex=\"-1\"\u003e\u003c/a\u003e            \u003cspan class=\"va\"\u003eself\u003c/span\u003e.bias,\u003c/span\u003e\n\u003cspan id=\"cb4-19\"\u003e\u003ca href=\"#cb4-19\" aria-hidden=\"true\" tabindex=\"-1\"\u003e\u003c/a\u003e            \u003cspan class=\"fl\"\u003e1e-5\u003c/span\u003e\u003c/span\u003e\n\u003cspan id=\"cb4-20\"\u003e\u003ca href=\"#cb4-20\" aria-hidden=\"true\" tabindex=\"-1\"\u003e\u003c/a\u003e        )\u003c/span\u003e\u003c/code\u003e\u003c/pre\u003e\u003c/div\u003e\n\u003c/div\u003e\n\u003cdiv id=\"tabset-1-3\" class=\"tab-pane\" role=\"tabpanel\" aria-labelledby=\"tabset-1-3-tab\"\u003e\n\u003cdiv class=\"sourceCode\" id=\"cb5\"\u003e\u003cpre class=\"sourceCode python\"\u003e\u003ccode class=\"sourceCode python\"\u003e\u003cspan id=\"cb5-1\"\u003e\u003ca href=\"#cb5-1\" aria-hidden=\"true\" tabindex=\"-1\"\u003e\u003c/a\u003e\u003cspan class=\"kw\"\u003eclass\u003c/span\u003e MLP(nn.Module):\u003c/span\u003e\n\u003cspan id=\"cb5-2\"\u003e\u003ca href=\"#cb5-2\" aria-hidden=\"true\" tabindex=\"-1\"\u003e\u003c/a\u003e\u003c/span\u003e\n\u003cspan id=\"cb5-3\"\u003e\u003ca href=\"#cb5-3\" aria-hidden=\"true\" tabindex=\"-1\"\u003e\u003c/a\u003e    \u003cspan class=\"kw\"\u003edef\u003c/span\u003e \u003cspan class=\"fu\"\u003e__init__\u003c/span\u003e(\u003c/span\u003e\n\u003cspan id=\"cb5-4\"\u003e\u003ca href=\"#cb5-4\" aria-hidden=\"true\" tabindex=\"-1\"\u003e\u003c/a\u003e            \u003cspan class=\"va\"\u003eself\u003c/span\u003e,\u003c/span\u003e\n\u003cspan id=\"cb5-5\"\u003e\u003ca href=\"#cb5-5\" aria-hidden=\"true\" tabindex=\"-1\"\u003e\u003c/a\u003e            config: GPTModelConfig,\u003c/span\u003e\n\u003cspan id=\"cb5-6\"\u003e\u003ca href=\"#cb5-6\" aria-hidden=\"true\" tabindex=\"-1\"\u003e\u003c/a\u003e            activation: \u003cspan class=\"bu\"\u003estr\u003c/span\u003e \u003cspan class=\"op\"\u003e=\u003c/span\u003e \u003cspan class=\"st\"\u003e'gelu'\u003c/span\u003e,\u003c/span\u003e\n\u003cspan id=\"cb5-7\"\u003e\u003ca href=\"#cb5-7\" aria-hidden=\"true\" tabindex=\"-1\"\u003e\u003c/a\u003e    ):\u003c/span\u003e\n\u003cspan id=\"cb5-8\"\u003e\u003ca href=\"#cb5-8\" aria-hidden=\"true\" tabindex=\"-1\"\u003e\u003c/a\u003e        \u003cspan class=\"bu\"\u003esuper\u003c/span\u003e().\u003cspan class=\"fu\"\u003e__init__\u003c/span\u003e()\u003c/span\u003e\n\u003cspan id=\"cb5-9\"\u003e\u003ca href=\"#cb5-9\" aria-hidden=\"true\" tabindex=\"-1\"\u003e\u003c/a\u003e        \u003cspan class=\"va\"\u003eself\u003c/span\u003e.c_fc \u003cspan class=\"op\"\u003e=\u003c/span\u003e nn.Linear(\u003c/span\u003e\n\u003cspan id=\"cb5-10\"\u003e\u003ca href=\"#cb5-10\" aria-hidden=\"true\" tabindex=\"-1\"\u003e\u003c/a\u003e            config.n_embd,\u003c/span\u003e\n\u003cspan id=\"cb5-11\"\u003e\u003ca href=\"#cb5-11\" aria-hidden=\"true\" tabindex=\"-1\"\u003e\u003c/a\u003e            \u003cspan class=\"dv\"\u003e4\u003c/span\u003e \u003cspan class=\"op\"\u003e*\u003c/span\u003e config.n_embd,\u003c/span\u003e\n\u003cspan id=\"cb5-12\"\u003e\u003ca href=\"#cb5-12\" aria-hidden=\"true\" tabindex=\"-1\"\u003e\u003c/a\u003e            bias\u003cspan class=\"op\"\u003e=\u003c/span\u003econfig.bias\u003c/span\u003e\n\u003cspan id=\"cb5-13\"\u003e\u003ca href=\"#cb5-13\" aria-hidden=\"true\" tabindex=\"-1\"\u003e\u003c/a\u003e        )\u003c/span\u003e\n\u003cspan id=\"cb5-14\"\u003e\u003ca href=\"#cb5-14\" aria-hidden=\"true\" tabindex=\"-1\"\u003e\u003c/a\u003e        \u003cspan class=\"cf\"\u003eif\u003c/span\u003e activation.lower() \u003cspan class=\"kw\"\u003ein\u003c/span\u003e ACTIVATIONS:\u003c/span\u003e\n\u003cspan id=\"cb5-15\"\u003e\u003ca href=\"#cb5-15\" aria-hidden=\"true\" tabindex=\"-1\"\u003e\u003c/a\u003e            \u003cspan class=\"va\"\u003eself\u003c/span\u003e.act_fn \u003cspan class=\"op\"\u003e=\u003c/span\u003e ACTIVATIONS[activation.lower()]\u003c/span\u003e\n\u003cspan id=\"cb5-16\"\u003e\u003ca href=\"#cb5-16\" aria-hidden=\"true\" tabindex=\"-1\"\u003e\u003c/a\u003e        \u003cspan class=\"cf\"\u003eelse\u003c/span\u003e:\u003c/span\u003e\n\u003cspan id=\"cb5-17\"\u003e\u003ca href=\"#cb5-17\" aria-hidden=\"true\" tabindex=\"-1\"\u003e\u003c/a\u003e            \u003cspan class=\"cf\"\u003etry\u003c/span\u003e:\u003c/span\u003e\n\u003cspan id=\"cb5-18\"\u003e\u003ca href=\"#cb5-18\" aria-hidden=\"true\" tabindex=\"-1\"\u003e\u003c/a\u003e                act_fn \u003cspan class=\"op\"\u003e=\u003c/span\u003e \u003cspan class=\"bu\"\u003egetattr\u003c/span\u003e(nn, activation)\u003c/span\u003e\n\u003cspan id=\"cb5-19\"\u003e\u003ca href=\"#cb5-19\" aria-hidden=\"true\" tabindex=\"-1\"\u003e\u003c/a\u003e                \u003cspan class=\"cf\"\u003eassert\u003c/span\u003e \u003cspan class=\"bu\"\u003ecallable\u003c/span\u003e(act_fn)\u003c/span\u003e\n\u003cspan id=\"cb5-20\"\u003e\u003ca href=\"#cb5-20\" aria-hidden=\"true\" tabindex=\"-1\"\u003e\u003c/a\u003e                \u003cspan class=\"va\"\u003eself\u003c/span\u003e.act_fn \u003cspan class=\"op\"\u003e=\u003c/span\u003e act_fn()\u003c/span\u003e\n\u003cspan id=\"cb5-21\"\u003e\u003ca href=\"#cb5-21\" aria-hidden=\"true\" tabindex=\"-1\"\u003e\u003c/a\u003e            \u003cspan class=\"cf\"\u003eexcept\u003c/span\u003e \u003cspan class=\"pp\"\u003eException\u003c/span\u003e \u003cspan class=\"im\"\u003eas\u003c/span\u003e exc:\u003c/span\u003e\n\u003cspan id=\"cb5-22\"\u003e\u003ca href=\"#cb5-22\" aria-hidden=\"true\" tabindex=\"-1\"\u003e\u003c/a\u003e                log.error(\u003cspan class=\"ss\"\u003ef'\u003c/span\u003e\u003cspan class=\"sc\"\u003e{\u003c/span\u003eactivation\u003cspan class=\"sc\"\u003e}\u003c/span\u003e\u003cspan class=\"ss\"\u003e not yet supported!'\u003c/span\u003e)\u003c/span\u003e\n\u003cspan id=\"cb5-23\"\u003e\u003ca href=\"#cb5-23\" aria-hidden=\"true\" tabindex=\"-1\"\u003e\u003c/a\u003e                \u003cspan class=\"cf\"\u003eraise\u003c/span\u003e exc\u003c/span\u003e\n\u003cspan id=\"cb5-24\"\u003e\u003ca href=\"#cb5-24\" aria-hidden=\"true\" tabindex=\"-1\"\u003e\u003c/a\u003e        \u003cspan class=\"co\"\u003e# self.gelu = nn.GELU()\u003c/span\u003e\u003c/span\u003e\n\u003cspan id=\"cb5-25\"\u003e\u003ca href=\"#cb5-25\" aria-hidden=\"true\" tabindex=\"-1\"\u003e\u003c/a\u003e        \u003cspan class=\"va\"\u003eself\u003c/span\u003e.c_proj \u003cspan class=\"op\"\u003e=\u003c/span\u003e nn.Linear(\u003c/span\u003e\n\u003cspan id=\"cb5-26\"\u003e\u003ca href=\"#cb5-26\" aria-hidden=\"true\" tabindex=\"-1\"\u003e\u003c/a\u003e            \u003cspan class=\"dv\"\u003e4\u003c/span\u003e \u003cspan class=\"op\"\u003e*\u003c/span\u003e config.n_embd,\u003c/span\u003e\n\u003cspan id=\"cb5-27\"\u003e\u003ca href=\"#cb5-27\" aria-hidden=\"true\" tabindex=\"-1\"\u003e\u003c/a\u003e            config.n_embd,\u003c/span\u003e\n\u003cspan id=\"cb5-28\"\u003e\u003ca href=\"#cb5-28\" aria-hidden=\"true\" tabindex=\"-1\"\u003e\u003c/a\u003e            bias\u003cspan class=\"op\"\u003e=\u003c/span\u003econfig.bias\u003c/span\u003e\n\u003cspan id=\"cb5-29\"\u003e\u003ca href=\"#cb5-29\" aria-hidden=\"true\" tabindex=\"-1\"\u003e\u003c/a\u003e        )\u003c/span\u003e\n\u003cspan id=\"cb5-30\"\u003e\u003ca href=\"#cb5-30\" aria-hidden=\"true\" tabindex=\"-1\"\u003e\u003c/a\u003e        \u003cspan class=\"va\"\u003eself\u003c/span\u003e.dropout \u003cspan class=\"op\"\u003e=\u003c/span\u003e nn.Dropout(config.dropout)\u003c/span\u003e\n\u003cspan id=\"cb5-31\"\u003e\u003ca href=\"#cb5-31\" aria-hidden=\"true\" tabindex=\"-1\"\u003e\u003c/a\u003e\u003c/span\u003e\n\u003cspan id=\"cb5-32\"\u003e\u003ca href=\"#cb5-32\" aria-hidden=\"true\" tabindex=\"-1\"\u003e\u003c/a\u003e    \u003cspan class=\"kw\"\u003edef\u003c/span\u003e forward(\u003cspan class=\"va\"\u003eself\u003c/span\u003e, x):\u003c/span\u003e\n\u003cspan id=\"cb5-33\"\u003e\u003ca href=\"#cb5-33\" aria-hidden=\"true\" tabindex=\"-1\"\u003e\u003c/a\u003e        x \u003cspan class=\"op\"\u003e=\u003c/span\u003e \u003cspan class=\"va\"\u003eself\u003c/span\u003e.c_fc(x)\u003c/span\u003e\n\u003cspan id=\"cb5-34\"\u003e\u003ca href=\"#cb5-34\" aria-hidden=\"true\" tabindex=\"-1\"\u003e\u003c/a\u003e        \u003cspan class=\"co\"\u003e# x = self.gelu(x)\u003c/span\u003e\u003c/span\u003e\n\u003cspan id=\"cb5-35\"\u003e\u003ca href=\"#cb5-35\" aria-hidden=\"true\" tabindex=\"-1\"\u003e\u003c/a\u003e        x \u003cspan class=\"op\"\u003e=\u003c/span\u003e \u003cspan class=\"va\"\u003eself\u003c/span\u003e.act_fn(x)\u003c/span\u003e\n\u003cspan id=\"cb5-36\"\u003e\u003ca href=\"#cb5-36\" aria-hidden=\"true\" tabindex=\"-1\"\u003e\u003c/a\u003e        x \u003cspan class=\"op\"\u003e=\u003c/span\u003e \u003cspan class=\"va\"\u003eself\u003c/span\u003e.c_proj(x)\u003c/span\u003e\n\u003cspan id=\"cb5-37\"\u003e\u003ca href=\"#cb5-37\" aria-hidden=\"true\" tabindex=\"-1\"\u003e\u003c/a\u003e        x \u003cspan class=\"op\"\u003e=\u003c/span\u003e \u003cspan class=\"va\"\u003eself\u003c/span\u003e.dropout(x)\u003c/span\u003e\n\u003cspan id=\"cb5-38\"\u003e\u003ca href=\"#cb5-38\" aria-hidden=\"true\" tabindex=\"-1\"\u003e\u003c/a\u003e        \u003cspan class=\"cf\"\u003ereturn\u003c/span\u003e x\u003c/span\u003e\u003c/code\u003e\u003c/pre\u003e\u003c/div\u003e\n\u003c/div\u003e\n\u003cdiv id=\"tabset-1-4\" class=\"tab-pane\" role=\"tabpanel\" aria-labelledby=\"tabset-1-4-tab\"\u003e\n\u003cdiv class=\"sourceCode\" id=\"cb6\"\u003e\u003cpre class=\"sourceCode python\"\u003e\u003ccode class=\"sourceCode python\"\u003e\u003cspan id=\"cb6-1\"\u003e\u003ca href=\"#cb6-1\" aria-hidden=\"true\" tabindex=\"-1\"\u003e\u003c/a\u003e\u003cspan class=\"kw\"\u003eclass\u003c/span\u003e Block(nn.Module):\u003c/span\u003e\n\u003cspan id=\"cb6-2\"\u003e\u003ca href=\"#cb6-2\" aria-hidden=\"true\" tabindex=\"-1\"\u003e\u003c/a\u003e\u003c/span\u003e\n\u003cspan id=\"cb6-3\"\u003e\u003ca href=\"#cb6-3\" aria-hidden=\"true\" tabindex=\"-1\"\u003e\u003c/a\u003e    \u003cspan class=\"kw\"\u003edef\u003c/span\u003e \u003cspan class=\"fu\"\u003e__init__\u003c/span\u003e(\u003cspan class=\"va\"\u003eself\u003c/span\u003e, config: GPTModelConfig):\u003c/span\u003e\n\u003cspan id=\"cb6-4\"\u003e\u003ca href=\"#cb6-4\" aria-hidden=\"true\" tabindex=\"-1\"\u003e\u003c/a\u003e        \u003cspan class=\"bu\"\u003esuper\u003c/span\u003e().\u003cspan class=\"fu\"\u003e__init__\u003c/span\u003e()\u003c/span\u003e\n\u003cspan id=\"cb6-5\"\u003e\u003ca href=\"#cb6-5\" aria-hidden=\"true\" tabindex=\"-1\"\u003e\u003c/a\u003e        \u003cspan class=\"va\"\u003eself\u003c/span\u003e.ln_1 \u003cspan class=\"op\"\u003e=\u003c/span\u003e LayerNorm(config.n_embd, bias\u003cspan class=\"op\"\u003e=\u003c/span\u003econfig.bias)\u003c/span\u003e\n\u003cspan id=\"cb6-6\"\u003e\u003ca href=\"#cb6-6\" aria-hidden=\"true\" tabindex=\"-1\"\u003e\u003c/a\u003e        \u003cspan class=\"va\"\u003eself\u003c/span\u003e.attn \u003cspan class=\"op\"\u003e=\u003c/span\u003e CausalSelfAttention(config)\u003c/span\u003e\n\u003cspan id=\"cb6-7\"\u003e\u003ca href=\"#cb6-7\" aria-hidden=\"true\" tabindex=\"-1\"\u003e\u003c/a\u003e        \u003cspan class=\"va\"\u003eself\u003c/span\u003e.ln_2 \u003cspan class=\"op\"\u003e=\u003c/span\u003e LayerNorm(config.n_embd, bias\u003cspan class=\"op\"\u003e=\u003c/span\u003econfig.bias)\u003c/span\u003e\n\u003cspan id=\"cb6-8\"\u003e\u003ca href=\"#cb6-8\" aria-hidden=\"true\" tabindex=\"-1\"\u003e\u003c/a\u003e        \u003cspan class=\"va\"\u003eself\u003c/span\u003e.mlp \u003cspan class=\"op\"\u003e=\u003c/span\u003e MLP(config)\u003c/span\u003e\n\u003cspan id=\"cb6-9\"\u003e\u003ca href=\"#cb6-9\" aria-hidden=\"true\" tabindex=\"-1\"\u003e\u003c/a\u003e\u003c/span\u003e\n\u003cspan id=\"cb6-10\"\u003e\u003ca href=\"#cb6-10\" aria-hidden=\"true\" tabindex=\"-1\"\u003e\u003c/a\u003e    \u003cspan class=\"kw\"\u003edef\u003c/span\u003e forward(\u003cspan class=\"va\"\u003eself\u003c/span\u003e, x):\u003c/span\u003e\n\u003cspan id=\"cb6-11\"\u003e\u003ca href=\"#cb6-11\" aria-hidden=\"true\" tabindex=\"-1\"\u003e\u003c/a\u003e        x \u003cspan class=\"op\"\u003e=\u003c/span\u003e x \u003cspan class=\"op\"\u003e+\u003c/span\u003e \u003cspan class=\"va\"\u003eself\u003c/span\u003e.attn(\u003cspan class=\"va\"\u003eself\u003c/span\u003e.ln_1(x))\u003c/span\u003e\n\u003cspan id=\"cb6-12\"\u003e\u003ca href=\"#cb6-12\" aria-hidden=\"true\" tabindex=\"-1\"\u003e\u003c/a\u003e        x \u003cspan class=\"op\"\u003e=\u003c/span\u003e x \u003cspan class=\"op\"\u003e+\u003c/span\u003e \u003cspan class=\"va\"\u003eself\u003c/span\u003e.mlp(\u003cspan class=\"va\"\u003eself\u003c/span\u003e.ln_2(x))\u003c/span\u003e\n\u003cspan id=\"cb6-13\"\u003e\u003ca href=\"#cb6-13\" aria-hidden=\"true\" tabindex=\"-1\"\u003e\u003c/a\u003e        \u003cspan class=\"cf\"\u003ereturn\u003c/span\u003e x\u003c/span\u003e\u003c/code\u003e\u003c/pre\u003e\u003c/div\u003e\n\u003c/div\u003e\n\u003cdiv id=\"tabset-1-5\" class=\"tab-pane\" role=\"tabpanel\" aria-labelledby=\"tabset-1-5-tab\"\u003e\n\u003cdiv class=\"sourceCode\" id=\"cb7\"\u003e\u003cpre class=\"sourceCode python\"\u003e\u003ccode class=\"sourceCode python\"\u003e\u003cspan id=\"cb7-1\"\u003e\u003ca href=\"#cb7-1\" aria-hidden=\"true\" tabindex=\"-1\"\u003e\u003c/a\u003e\u003cspan class=\"kw\"\u003eclass\u003c/span\u003e GPT(nn.Module):\u003c/span\u003e\n\u003cspan id=\"cb7-2\"\u003e\u003ca href=\"#cb7-2\" aria-hidden=\"true\" tabindex=\"-1\"\u003e\u003c/a\u003e    \u003cspan class=\"kw\"\u003edef\u003c/span\u003e \u003cspan class=\"fu\"\u003e__init__\u003c/span\u003e(\u003cspan class=\"va\"\u003eself\u003c/span\u003e, config: GPTModelConfig):\u003c/span\u003e\n\u003cspan id=\"cb7-3\"\u003e\u003ca href=\"#cb7-3\" aria-hidden=\"true\" tabindex=\"-1\"\u003e\u003c/a\u003e        \u003cspan class=\"bu\"\u003esuper\u003c/span\u003e().\u003cspan class=\"fu\"\u003e__init__\u003c/span\u003e()\u003c/span\u003e\n\u003cspan id=\"cb7-4\"\u003e\u003ca href=\"#cb7-4\" aria-hidden=\"true\" tabindex=\"-1\"\u003e\u003c/a\u003e        \u003cspan class=\"cf\"\u003eassert\u003c/span\u003e config.vocab_size \u003cspan class=\"kw\"\u003eis\u003c/span\u003e \u003cspan class=\"kw\"\u003enot\u003c/span\u003e \u003cspan class=\"va\"\u003eNone\u003c/span\u003e\u003c/span\u003e\n\u003cspan id=\"cb7-5\"\u003e\u003ca href=\"#cb7-5\" aria-hidden=\"true\" tabindex=\"-1\"\u003e\u003c/a\u003e        \u003cspan class=\"cf\"\u003eassert\u003c/span\u003e config.block_size \u003cspan class=\"kw\"\u003eis\u003c/span\u003e \u003cspan class=\"kw\"\u003enot\u003c/span\u003e \u003cspan class=\"va\"\u003eNone\u003c/span\u003e\u003c/span\u003e\n\u003cspan id=\"cb7-6\"\u003e\u003ca href=\"#cb7-6\" aria-hidden=\"true\" tabindex=\"-1\"\u003e\u003c/a\u003e        \u003cspan class=\"va\"\u003eself\u003c/span\u003e.config \u003cspan class=\"op\"\u003e=\u003c/span\u003e config\u003c/span\u003e\n\u003cspan id=\"cb7-7\"\u003e\u003ca href=\"#cb7-7\" aria-hidden=\"true\" tabindex=\"-1\"\u003e\u003c/a\u003e\u003c/span\u003e\n\u003cspan id=\"cb7-8\"\u003e\u003ca href=\"#cb7-8\" aria-hidden=\"true\" tabindex=\"-1\"\u003e\u003c/a\u003e        \u003cspan class=\"va\"\u003eself\u003c/span\u003e.transformer \u003cspan class=\"op\"\u003e=\u003c/span\u003e nn.ModuleDict(\u003cspan class=\"bu\"\u003edict\u003c/span\u003e(\u003c/span\u003e\n\u003cspan id=\"cb7-9\"\u003e\u003ca href=\"#cb7-9\" aria-hidden=\"true\" tabindex=\"-1\"\u003e\u003c/a\u003e            wte\u003cspan class=\"op\"\u003e=\u003c/span\u003enn.Embedding(config.vocab_size, config.n_embd),\u003c/span\u003e\n\u003cspan id=\"cb7-10\"\u003e\u003ca href=\"#cb7-10\" aria-hidden=\"true\" tabindex=\"-1\"\u003e\u003c/a\u003e            wpe\u003cspan class=\"op\"\u003e=\u003c/span\u003enn.Embedding(config.block_size, config.n_embd),\u003c/span\u003e\n\u003cspan id=\"cb7-11\"\u003e\u003ca href=\"#cb7-11\" aria-hidden=\"true\" tabindex=\"-1\"\u003e\u003c/a\u003e            drop\u003cspan class=\"op\"\u003e=\u003c/span\u003enn.Dropout(config.dropout),\u003c/span\u003e\n\u003cspan id=\"cb7-12\"\u003e\u003ca href=\"#cb7-12\" aria-hidden=\"true\" tabindex=\"-1\"\u003e\u003c/a\u003e            h\u003cspan class=\"op\"\u003e=\u003c/span\u003enn.ModuleList([Block(config) \u003cspan class=\"cf\"\u003efor\u003c/span\u003e _ \u003cspan class=\"kw\"\u003ein\u003c/span\u003e \u003cspan class=\"bu\"\u003erange\u003c/span\u003e(config.n_layer)]),\u003c/span\u003e\n\u003cspan id=\"cb7-13\"\u003e\u003ca href=\"#cb7-13\" aria-hidden=\"true\" tabindex=\"-1\"\u003e\u003c/a\u003e            ln_f\u003cspan class=\"op\"\u003e=\u003c/span\u003eLayerNorm(config.n_embd, bias\u003cspan class=\"op\"\u003e=\u003c/span\u003econfig.bias),\u003c/span\u003e\n\u003cspan id=\"cb7-14\"\u003e\u003ca href=\"#cb7-14\" aria-hidden=\"true\" tabindex=\"-1\"\u003e\u003c/a\u003e        ))\u003c/span\u003e\n\u003cspan id=\"cb7-15\"\u003e\u003ca href=\"#cb7-15\" aria-hidden=\"true\" tabindex=\"-1\"\u003e\u003c/a\u003e        \u003cspan class=\"va\"\u003eself\u003c/span\u003e.lm_head \u003cspan class=\"op\"\u003e=\u003c/span\u003e nn.Linear(config.n_embd, config.vocab_size, bias\u003cspan class=\"op\"\u003e=\u003c/span\u003e\u003cspan class=\"va\"\u003eFalse\u003c/span\u003e)\u003c/span\u003e\n\u003cspan id=\"cb7-16\"\u003e\u003ca href=\"#cb7-16\" aria-hidden=\"true\" tabindex=\"-1\"\u003e\u003c/a\u003e        \u003cspan class=\"co\"\u003e# with weight tying when using torch.compile() some warnings get\u003c/span\u003e\u003c/span\u003e\n\u003cspan id=\"cb7-17\"\u003e\u003ca href=\"#cb7-17\" aria-hidden=\"true\" tabindex=\"-1\"\u003e\u003c/a\u003e        \u003cspan class=\"co\"\u003e# generated: \"UserWarning: functional_call was passed multiple values\u003c/span\u003e\u003c/span\u003e\n\u003cspan id=\"cb7-18\"\u003e\u003ca href=\"#cb7-18\" aria-hidden=\"true\" tabindex=\"-1\"\u003e\u003c/a\u003e        \u003cspan class=\"co\"\u003e# for tied weights. This behavior is deprecated and will be an error in\u003c/span\u003e\u003c/span\u003e\n\u003cspan id=\"cb7-19\"\u003e\u003ca href=\"#cb7-19\" aria-hidden=\"true\" tabindex=\"-1\"\u003e\u003c/a\u003e        \u003cspan class=\"co\"\u003e# future versions\" not 100% sure what this is, so far seems to be\u003c/span\u003e\u003c/span\u003e\n\u003cspan id=\"cb7-20\"\u003e\u003ca href=\"#cb7-20\" aria-hidden=\"true\" tabindex=\"-1\"\u003e\u003c/a\u003e        \u003cspan class=\"co\"\u003e# harmless. \u003c/span\u003e\u003cspan class=\"al\"\u003eTODO\u003c/span\u003e\u003cspan class=\"co\"\u003e investigate\u003c/span\u003e\u003c/span\u003e\n\u003cspan id=\"cb7-21\"\u003e\u003ca href=\"#cb7-21\" aria-hidden=\"true\" tabindex=\"-1\"\u003e\u003c/a\u003e        \u003cspan class=\"co\"\u003e# https://paperswithcode.com/method/weight-tying\u003c/span\u003e\u003c/span\u003e\n\u003cspan id=\"cb7-22\"\u003e\u003ca href=\"#cb7-22\" aria-hidden=\"true\" tabindex=\"-1\"\u003e\u003c/a\u003e        \u003cspan class=\"va\"\u003eself\u003c/span\u003e.transformer.wte.weight \u003cspan class=\"op\"\u003e=\u003c/span\u003e \u003cspan class=\"va\"\u003eself\u003c/span\u003e.lm_head.weight  \u003cspan class=\"co\"\u003e# type:ignore\u003c/span\u003e\u003c/span\u003e\n\u003cspan id=\"cb7-23\"\u003e\u003ca href=\"#cb7-23\" aria-hidden=\"true\" tabindex=\"-1\"\u003e\u003c/a\u003e\u003c/span\u003e\n\u003cspan id=\"cb7-24\"\u003e\u003ca href=\"#cb7-24\" aria-hidden=\"true\" tabindex=\"-1\"\u003e\u003c/a\u003e        \u003cspan class=\"co\"\u003e# init all weights\u003c/span\u003e\u003c/span\u003e\n\u003cspan id=\"cb7-25\"\u003e\u003ca href=\"#cb7-25\" aria-hidden=\"true\" tabindex=\"-1\"\u003e\u003c/a\u003e        \u003cspan class=\"va\"\u003eself\u003c/span\u003e.\u003cspan class=\"bu\"\u003eapply\u003c/span\u003e(\u003cspan class=\"va\"\u003eself\u003c/span\u003e._init_weights)\u003c/span\u003e\n\u003cspan id=\"cb7-26\"\u003e\u003ca href=\"#cb7-26\" aria-hidden=\"true\" tabindex=\"-1\"\u003e\u003c/a\u003e        \u003cspan class=\"co\"\u003e# apply special scaled init to the residual projections, per GPT-2\u003c/span\u003e\u003c/span\u003e\n\u003cspan id=\"cb7-27\"\u003e\u003ca href=\"#cb7-27\" aria-hidden=\"true\" tabindex=\"-1\"\u003e\u003c/a\u003e        \u003cspan class=\"cf\"\u003efor\u003c/span\u003e pn, p \u003cspan class=\"kw\"\u003ein\u003c/span\u003e \u003cspan class=\"va\"\u003eself\u003c/span\u003e.named_parameters():\u003c/span\u003e\n\u003cspan id=\"cb7-28\"\u003e\u003ca href=\"#cb7-28\" aria-hidden=\"true\" tabindex=\"-1\"\u003e\u003c/a\u003e            \u003cspan class=\"cf\"\u003eif\u003c/span\u003e pn.endswith(\u003cspan class=\"st\"\u003e'c_proj.weight'\u003c/span\u003e):\u003c/span\u003e\n\u003cspan id=\"cb7-29\"\u003e\u003ca href=\"#cb7-29\" aria-hidden=\"true\" tabindex=\"-1\"\u003e\u003c/a\u003e                torch.nn.init.normal_(\u003c/span\u003e\n\u003cspan id=\"cb7-30\"\u003e\u003ca href=\"#cb7-30\" aria-hidden=\"true\" tabindex=\"-1\"\u003e\u003c/a\u003e                    p,\u003c/span\u003e\n\u003cspan id=\"cb7-31\"\u003e\u003ca href=\"#cb7-31\" aria-hidden=\"true\" tabindex=\"-1\"\u003e\u003c/a\u003e                    mean\u003cspan class=\"op\"\u003e=\u003c/span\u003e\u003cspan class=\"fl\"\u003e0.0\u003c/span\u003e,\u003c/span\u003e\n\u003cspan id=\"cb7-32\"\u003e\u003ca href=\"#cb7-32\" aria-hidden=\"true\" tabindex=\"-1\"\u003e\u003c/a\u003e                    std\u003cspan class=\"op\"\u003e=\u003c/span\u003e\u003cspan class=\"fl\"\u003e0.02\u003c/span\u003e\u003cspan class=\"op\"\u003e/\u003c/span\u003emath.sqrt(\u003cspan class=\"dv\"\u003e2\u003c/span\u003e \u003cspan class=\"op\"\u003e*\u003c/span\u003e config.n_layer)\u003c/span\u003e\n\u003cspan id=\"cb7-33\"\u003e\u003ca href=\"#cb7-33\" aria-hidden=\"true\" tabindex=\"-1\"\u003e\u003c/a\u003e                )\u003c/span\u003e\n\u003cspan id=\"cb7-34\"\u003e\u003ca href=\"#cb7-34\" aria-hidden=\"true\" tabindex=\"-1\"\u003e\u003c/a\u003e\u003c/span\u003e\n\u003cspan id=\"cb7-35\"\u003e\u003ca href=\"#cb7-35\" aria-hidden=\"true\" tabindex=\"-1\"\u003e\u003c/a\u003e        \u003cspan class=\"co\"\u003e# report number of parameters\u003c/span\u003e\u003c/span\u003e\n\u003cspan id=\"cb7-36\"\u003e\u003ca href=\"#cb7-36\" aria-hidden=\"true\" tabindex=\"-1\"\u003e\u003c/a\u003e        log.info(\u003cspan class=\"st\"\u003e\"number of parameters: \u003c/span\u003e\u003cspan class=\"sc\"\u003e%.2f\u003c/span\u003e\u003cspan class=\"st\"\u003eM\"\u003c/span\u003e \u003cspan class=\"op\"\u003e%\u003c/span\u003e (\u003cspan class=\"va\"\u003eself\u003c/span\u003e.get_num_params()\u003cspan class=\"op\"\u003e/\u003c/span\u003e\u003cspan class=\"fl\"\u003e1e6\u003c/span\u003e,))\u003c/span\u003e\n\u003cspan id=\"cb7-37\"\u003e\u003ca href=\"#cb7-37\" aria-hidden=\"true\" tabindex=\"-1\"\u003e\u003c/a\u003e\u003c/span\u003e\n\u003cspan id=\"cb7-38\"\u003e\u003ca href=\"#cb7-38\" aria-hidden=\"true\" tabindex=\"-1\"\u003e\u003c/a\u003e    \u003cspan class=\"kw\"\u003edef\u003c/span\u003e get_num_params(\u003cspan class=\"va\"\u003eself\u003c/span\u003e, non_embedding\u003cspan class=\"op\"\u003e=\u003c/span\u003e\u003cspan class=\"va\"\u003eTrue\u003c/span\u003e):\u003c/span\u003e\n\u003cspan id=\"cb7-39\"\u003e\u003ca href=\"#cb7-39\" aria-hidden=\"true\" tabindex=\"-1\"\u003e\u003c/a\u003e        \u003cspan class=\"co\"\u003e\"\"\"\u003c/span\u003e\u003c/span\u003e\n\u003cspan id=\"cb7-40\"\u003e\u003ca href=\"#cb7-40\" aria-hidden=\"true\" tabindex=\"-1\"\u003e\u003c/a\u003e\u003cspan class=\"co\"\u003e        Return the number of parameters in the model.\u003c/span\u003e\u003c/span\u003e\n\u003cspan id=\"cb7-41\"\u003e\u003ca href=\"#cb7-41\" aria-hidden=\"true\" tabindex=\"-1\"\u003e\u003c/a\u003e\u003cspan class=\"co\"\u003e        For non-embedding count (default), the position embeddings get\u003c/span\u003e\u003c/span\u003e\n\u003cspan id=\"cb7-42\"\u003e\u003ca href=\"#cb7-42\" aria-hidden=\"true\" tabindex=\"-1\"\u003e\u003c/a\u003e\u003cspan class=\"co\"\u003e        subtracted.\u003c/span\u003e\u003c/span\u003e\n\u003cspan id=\"cb7-43\"\u003e\u003ca href=\"#cb7-43\" aria-hidden=\"true\" tabindex=\"-1\"\u003e\u003c/a\u003e\u003c/span\u003e\n\u003cspan id=\"cb7-44\"\u003e\u003ca href=\"#cb7-44\" aria-hidden=\"true\" tabindex=\"-1\"\u003e\u003c/a\u003e\u003cspan class=\"co\"\u003e        The token embeddings would too, except due to the parameter sharing\u003c/span\u003e\u003c/span\u003e\n\u003cspan id=\"cb7-45\"\u003e\u003ca href=\"#cb7-45\" aria-hidden=\"true\" tabindex=\"-1\"\u003e\u003c/a\u003e\u003cspan class=\"co\"\u003e        these params are actually used as weights in the final layer, so we\u003c/span\u003e\u003c/span\u003e\n\u003cspan id=\"cb7-46\"\u003e\u003ca href=\"#cb7-46\" aria-hidden=\"true\" tabindex=\"-1\"\u003e\u003c/a\u003e\u003cspan class=\"co\"\u003e        include them.\u003c/span\u003e\u003c/span\u003e\n\u003cspan id=\"cb7-47\"\u003e\u003ca href=\"#cb7-47\" aria-hidden=\"true\" tabindex=\"-1\"\u003e\u003c/a\u003e\u003cspan class=\"co\"\u003e        \"\"\"\u003c/span\u003e\u003c/span\u003e\n\u003cspan id=\"cb7-48\"\u003e\u003ca href=\"#cb7-48\" aria-hidden=\"true\" tabindex=\"-1\"\u003e\u003c/a\u003e        n_params \u003cspan class=\"op\"\u003e=\u003c/span\u003e \u003cspan class=\"bu\"\u003esum\u003c/span\u003e(p.numel() \u003cspan class=\"cf\"\u003efor\u003c/span\u003e p \u003cspan class=\"kw\"\u003ein\u003c/span\u003e \u003cspan class=\"va\"\u003eself\u003c/span\u003e.parameters())\u003c/span\u003e\n\u003cspan id=\"cb7-49\"\u003e\u003ca href=\"#cb7-49\" aria-hidden=\"true\" tabindex=\"-1\"\u003e\u003c/a\u003e        \u003cspan class=\"cf\"\u003eif\u003c/span\u003e non_embedding:\u003c/span\u003e\n\u003cspan id=\"cb7-50\"\u003e\u003ca href=\"#cb7-50\" aria-hidden=\"true\" tabindex=\"-1\"\u003e\u003c/a\u003e            n_params \u003cspan class=\"op\"\u003e-=\u003c/span\u003e \u003cspan class=\"va\"\u003eself\u003c/span\u003e.transformer.wpe.weight.numel()  \u003cspan class=\"co\"\u003e# type:ignore\u003c/span\u003e\u003c/span\u003e\n\u003cspan id=\"cb7-51\"\u003e\u003ca href=\"#cb7-51\" aria-hidden=\"true\" tabindex=\"-1\"\u003e\u003c/a\u003e        \u003cspan class=\"cf\"\u003ereturn\u003c/span\u003e n_params\u003c/span\u003e\n\u003cspan id=\"cb7-52\"\u003e\u003ca href=\"#cb7-52\" aria-hidden=\"true\" tabindex=\"-1\"\u003e\u003c/a\u003e\u003c/span\u003e\n\u003cspan id=\"cb7-53\"\u003e\u003ca href=\"#cb7-53\" aria-hidden=\"true\" tabindex=\"-1\"\u003e\u003c/a\u003e    \u003cspan class=\"kw\"\u003edef\u003c/span\u003e _init_weights(\u003cspan class=\"va\"\u003eself\u003c/span\u003e, module):\u003c/span\u003e\n\u003cspan id=\"cb7-54\"\u003e\u003ca href=\"#cb7-54\" aria-hidden=\"true\" tabindex=\"-1\"\u003e\u003c/a\u003e        \u003cspan class=\"cf\"\u003eif\u003c/span\u003e \u003cspan class=\"bu\"\u003eisinstance\u003c/span\u003e(module, nn.Linear):\u003c/span\u003e\n\u003cspan id=\"cb7-55\"\u003e\u003ca href=\"#cb7-55\" aria-hidden=\"true\" tabindex=\"-1\"\u003e\u003c/a\u003e            torch.nn.init.normal_(module.weight, mean\u003cspan class=\"op\"\u003e=\u003c/span\u003e\u003cspan class=\"fl\"\u003e0.0\u003c/span\u003e, std\u003cspan class=\"op\"\u003e=\u003c/span\u003e\u003cspan class=\"fl\"\u003e0.02\u003c/span\u003e)\u003c/span\u003e\n\u003cspan id=\"cb7-56\"\u003e\u003ca href=\"#cb7-56\" aria-hidden=\"true\" tabindex=\"-1\"\u003e\u003c/a\u003e            \u003cspan class=\"cf\"\u003eif\u003c/span\u003e module.bias \u003cspan class=\"kw\"\u003eis\u003c/span\u003e \u003cspan class=\"kw\"\u003enot\u003c/span\u003e \u003cspan class=\"va\"\u003eNone\u003c/span\u003e:\u003c/span\u003e\n\u003cspan id=\"cb7-57\"\u003e\u003ca href=\"#cb7-57\" aria-hidden=\"true\" tabindex=\"-1\"\u003e\u003c/a\u003e                torch.nn.init.zeros_(module.bias)\u003c/span\u003e\n\u003cspan id=\"cb7-58\"\u003e\u003ca href=\"#cb7-58\" aria-hidden=\"true\" tabindex=\"-1\"\u003e\u003c/a\u003e        \u003cspan class=\"cf\"\u003eelif\u003c/span\u003e \u003cspan class=\"bu\"\u003eisinstance\u003c/span\u003e(module, nn.Embedding):\u003c/span\u003e\n\u003cspan id=\"cb7-59\"\u003e\u003ca href=\"#cb7-59\" aria-hidden=\"true\" tabindex=\"-1\"\u003e\u003c/a\u003e            torch.nn.init.normal_(module.weight, mean\u003cspan class=\"op\"\u003e=\u003c/span\u003e\u003cspan class=\"fl\"\u003e0.0\u003c/span\u003e, std\u003cspan class=\"op\"\u003e=\u003c/span\u003e\u003cspan class=\"fl\"\u003e0.02\u003c/span\u003e)\u003c/span\u003e\n\u003cspan id=\"cb7-60\"\u003e\u003ca href=\"#cb7-60\" aria-hidden=\"true\" tabindex=\"-1\"\u003e\u003c/a\u003e\u003c/span\u003e\n\u003cspan id=\"cb7-61\"\u003e\u003ca href=\"#cb7-61\" aria-hidden=\"true\" tabindex=\"-1\"\u003e\u003c/a\u003e    \u003cspan class=\"kw\"\u003edef\u003c/span\u003e forward(\u003cspan class=\"va\"\u003eself\u003c/span\u003e, idx, targets\u003cspan class=\"op\"\u003e=\u003c/span\u003e\u003cspan class=\"va\"\u003eNone\u003c/span\u003e):\u003c/span\u003e\n\u003cspan id=\"cb7-62\"\u003e\u003ca href=\"#cb7-62\" aria-hidden=\"true\" tabindex=\"-1\"\u003e\u003c/a\u003e        device \u003cspan class=\"op\"\u003e=\u003c/span\u003e idx.device\u003c/span\u003e\n\u003cspan id=\"cb7-63\"\u003e\u003ca href=\"#cb7-63\" aria-hidden=\"true\" tabindex=\"-1\"\u003e\u003c/a\u003e        b, t \u003cspan class=\"op\"\u003e=\u003c/span\u003e idx.size()\u003c/span\u003e\n\u003cspan id=\"cb7-64\"\u003e\u003ca href=\"#cb7-64\" aria-hidden=\"true\" tabindex=\"-1\"\u003e\u003c/a\u003e        \u003cspan class=\"cf\"\u003eassert\u003c/span\u003e t \u003cspan class=\"op\"\u003e\u0026lt;=\u003c/span\u003e \u003cspan class=\"va\"\u003eself\u003c/span\u003e.config.block_size, (\u003c/span\u003e\n\u003cspan id=\"cb7-65\"\u003e\u003ca href=\"#cb7-65\" aria-hidden=\"true\" tabindex=\"-1\"\u003e\u003c/a\u003e            \u003cspan class=\"ss\"\u003ef\"Cannot forward sequence of length \u003c/span\u003e\u003cspan class=\"sc\"\u003e{\u003c/span\u003et\u003cspan class=\"sc\"\u003e}\u003c/span\u003e\u003cspan class=\"ss\"\u003e, \"\u003c/span\u003e\u003c/span\u003e\n\u003cspan id=\"cb7-66\"\u003e\u003ca href=\"#cb7-66\" aria-hidden=\"true\" tabindex=\"-1\"\u003e\u003c/a\u003e            \u003cspan class=\"st\"\u003e\"block size is only \u003c/span\u003e\u003cspan class=\"sc\"\u003e{self.config.block_size}\u003c/span\u003e\u003cspan class=\"st\"\u003e\"\u003c/span\u003e\u003c/span\u003e\n\u003cspan id=\"cb7-67\"\u003e\u003ca href=\"#cb7-67\" aria-hidden=\"true\" tabindex=\"-1\"\u003e\u003c/a\u003e        )\u003c/span\u003e\n\u003cspan id=\"cb7-68\"\u003e\u003ca href=\"#cb7-68\" aria-hidden=\"true\" tabindex=\"-1\"\u003e\u003c/a\u003e        pos \u003cspan class=\"op\"\u003e=\u003c/span\u003e torch.arange(\u003c/span\u003e\n\u003cspan id=\"cb7-69\"\u003e\u003ca href=\"#cb7-69\" aria-hidden=\"true\" tabindex=\"-1\"\u003e\u003c/a\u003e            \u003cspan class=\"dv\"\u003e0\u003c/span\u003e,\u003c/span\u003e\n\u003cspan id=\"cb7-70\"\u003e\u003ca href=\"#cb7-70\" aria-hidden=\"true\" tabindex=\"-1\"\u003e\u003c/a\u003e            t,\u003c/span\u003e\n\u003cspan id=\"cb7-71\"\u003e\u003ca href=\"#cb7-71\" aria-hidden=\"true\" tabindex=\"-1\"\u003e\u003c/a\u003e            dtype\u003cspan class=\"op\"\u003e=\u003c/span\u003etorch.\u003cspan class=\"bu\"\u003elong\u003c/span\u003e,\u003c/span\u003e\n\u003cspan id=\"cb7-72\"\u003e\u003ca href=\"#cb7-72\" aria-hidden=\"true\" tabindex=\"-1\"\u003e\u003c/a\u003e            device\u003cspan class=\"op\"\u003e=\u003c/span\u003edevice\u003c/span\u003e\n\u003cspan id=\"cb7-73\"\u003e\u003ca href=\"#cb7-73\" aria-hidden=\"true\" tabindex=\"-1\"\u003e\u003c/a\u003e        )  \u003cspan class=\"co\"\u003e# shape (t)\u003c/span\u003e\u003c/span\u003e\n\u003cspan id=\"cb7-74\"\u003e\u003ca href=\"#cb7-74\" aria-hidden=\"true\" tabindex=\"-1\"\u003e\u003c/a\u003e\u003c/span\u003e\n\u003cspan id=\"cb7-75\"\u003e\u003ca href=\"#cb7-75\" aria-hidden=\"true\" tabindex=\"-1\"\u003e\u003c/a\u003e        \u003cspan class=\"co\"\u003e# forward the GPT model itself\u003c/span\u003e\u003c/span\u003e\n\u003cspan id=\"cb7-76\"\u003e\u003ca href=\"#cb7-76\" aria-hidden=\"true\" tabindex=\"-1\"\u003e\u003c/a\u003e        \u003cspan class=\"co\"\u003e# token embeddings of shape (b, t, n_embd)\u003c/span\u003e\u003c/span\u003e\n\u003cspan id=\"cb7-77\"\u003e\u003ca href=\"#cb7-77\" aria-hidden=\"true\" tabindex=\"-1\"\u003e\u003c/a\u003e        tok_emb \u003cspan class=\"op\"\u003e=\u003c/span\u003e \u003cspan class=\"va\"\u003eself\u003c/span\u003e.transformer.wte(idx)  \u003cspan class=\"co\"\u003e# type:ignore\u003c/span\u003e\u003c/span\u003e\n\u003cspan id=\"cb7-78\"\u003e\u003ca href=\"#cb7-78\" aria-hidden=\"true\" tabindex=\"-1\"\u003e\u003c/a\u003e        \u003cspan class=\"co\"\u003e# position embeddings of shape (t, n_embd)\u003c/span\u003e\u003c/span\u003e\n\u003cspan id=\"cb7-79\"\u003e\u003ca href=\"#cb7-79\" aria-hidden=\"true\" tabindex=\"-1\"\u003e\u003c/a\u003e        pos_emb \u003cspan class=\"op\"\u003e=\u003c/span\u003e \u003cspan class=\"va\"\u003eself\u003c/span\u003e.transformer.wpe(pos)  \u003cspan class=\"co\"\u003e# type:ignore\u003c/span\u003e\u003c/span\u003e\n\u003cspan id=\"cb7-80\"\u003e\u003ca href=\"#cb7-80\" aria-hidden=\"true\" tabindex=\"-1\"\u003e\u003c/a\u003e        x \u003cspan class=\"op\"\u003e=\u003c/span\u003e \u003cspan class=\"va\"\u003eself\u003c/span\u003e.transformer.drop(tok_emb \u003cspan class=\"op\"\u003e+\u003c/span\u003e pos_emb)  \u003cspan class=\"co\"\u003e# type:ignore\u003c/span\u003e\u003c/span\u003e\n\u003cspan id=\"cb7-81\"\u003e\u003ca href=\"#cb7-81\" aria-hidden=\"true\" tabindex=\"-1\"\u003e\u003c/a\u003e        \u003cspan class=\"cf\"\u003efor\u003c/span\u003e block \u003cspan class=\"kw\"\u003ein\u003c/span\u003e \u003cspan class=\"va\"\u003eself\u003c/span\u003e.transformer.h:  \u003cspan class=\"co\"\u003e# type:ignore\u003c/span\u003e\u003c/span\u003e\n\u003cspan id=\"cb7-82\"\u003e\u003ca href=\"#cb7-82\" aria-hidden=\"true\" tabindex=\"-1\"\u003e\u003c/a\u003e            x \u003cspan class=\"op\"\u003e=\u003c/span\u003e block(x)\u003c/span\u003e\n\u003cspan id=\"cb7-83\"\u003e\u003ca href=\"#cb7-83\" aria-hidden=\"true\" tabindex=\"-1\"\u003e\u003c/a\u003e        x \u003cspan class=\"op\"\u003e=\u003c/span\u003e \u003cspan class=\"va\"\u003eself\u003c/span\u003e.transformer.ln_f(x)  \u003cspan class=\"co\"\u003e# type:ignore\u003c/span\u003e\u003c/span\u003e\n\u003cspan id=\"cb7-84\"\u003e\u003ca href=\"#cb7-84\" aria-hidden=\"true\" tabindex=\"-1\"\u003e\u003c/a\u003e        \u003cspan class=\"cf\"\u003eif\u003c/span\u003e targets \u003cspan class=\"kw\"\u003eis\u003c/span\u003e \u003cspan class=\"kw\"\u003enot\u003c/span\u003e \u003cspan class=\"va\"\u003eNone\u003c/span\u003e:\u003c/span\u003e\n\u003cspan id=\"cb7-85\"\u003e\u003ca href=\"#cb7-85\" aria-hidden=\"true\" tabindex=\"-1\"\u003e\u003c/a\u003e            \u003cspan class=\"co\"\u003e# if we are given some desired targets also calculate the loss\u003c/span\u003e\u003c/span\u003e\n\u003cspan id=\"cb7-86\"\u003e\u003ca href=\"#cb7-86\" aria-hidden=\"true\" tabindex=\"-1\"\u003e\u003c/a\u003e            logits \u003cspan class=\"op\"\u003e=\u003c/span\u003e \u003cspan class=\"va\"\u003eself\u003c/span\u003e.lm_head(x)\u003c/span\u003e\n\u003cspan id=\"cb7-87\"\u003e\u003ca href=\"#cb7-87\" aria-hidden=\"true\" tabindex=\"-1\"\u003e\u003c/a\u003e            loss \u003cspan class=\"op\"\u003e=\u003c/span\u003e F.cross_entropy(\u003c/span\u003e\n\u003cspan id=\"cb7-88\"\u003e\u003ca href=\"#cb7-88\" aria-hidden=\"true\" tabindex=\"-1\"\u003e\u003c/a\u003e                logits.view(\u003c/span\u003e\n\u003cspan id=\"cb7-89\"\u003e\u003ca href=\"#cb7-89\" aria-hidden=\"true\" tabindex=\"-1\"\u003e\u003c/a\u003e                    \u003cspan class=\"op\"\u003e-\u003c/span\u003e\u003cspan class=\"dv\"\u003e1\u003c/span\u003e,\u003c/span\u003e\n\u003cspan id=\"cb7-90\"\u003e\u003ca href=\"#cb7-90\" aria-hidden=\"true\" tabindex=\"-1\"\u003e\u003c/a\u003e                    logits.size(\u003cspan class=\"op\"\u003e-\u003c/span\u003e\u003cspan class=\"dv\"\u003e1\u003c/span\u003e)\u003c/span\u003e\n\u003cspan id=\"cb7-91\"\u003e\u003ca href=\"#cb7-91\" aria-hidden=\"true\" tabindex=\"-1\"\u003e\u003c/a\u003e                ),\u003c/span\u003e\n\u003cspan id=\"cb7-92\"\u003e\u003ca href=\"#cb7-92\" aria-hidden=\"true\" tabindex=\"-1\"\u003e\u003c/a\u003e                targets.view(\u003cspan class=\"op\"\u003e-\u003c/span\u003e\u003cspan class=\"dv\"\u003e1\u003c/span\u003e),\u003c/span\u003e\n\u003cspan id=\"cb7-93\"\u003e\u003ca href=\"#cb7-93\" aria-hidden=\"true\" tabindex=\"-1\"\u003e\u003c/a\u003e                ignore_index\u003cspan class=\"op\"\u003e=-\u003c/span\u003e\u003cspan class=\"dv\"\u003e1\u003c/span\u003e\u003c/span\u003e\n\u003cspan id=\"cb7-94\"\u003e\u003ca href=\"#cb7-94\" aria-hidden=\"true\" tabindex=\"-1\"\u003e\u003c/a\u003e            )\u003c/span\u003e\n\u003cspan id=\"cb7-95\"\u003e\u003ca href=\"#cb7-95\" aria-hidden=\"true\" tabindex=\"-1\"\u003e\u003c/a\u003e        \u003cspan class=\"cf\"\u003eelse\u003c/span\u003e:\u003c/span\u003e\n\u003cspan id=\"cb7-96\"\u003e\u003ca href=\"#cb7-96\" aria-hidden=\"true\" tabindex=\"-1\"\u003e\u003c/a\u003e            \u003cspan class=\"co\"\u003e# inference-time mini-optimization: only forward the lm_head on the\u003c/span\u003e\u003c/span\u003e\n\u003cspan id=\"cb7-97\"\u003e\u003ca href=\"#cb7-97\" aria-hidden=\"true\" tabindex=\"-1\"\u003e\u003c/a\u003e            \u003cspan class=\"co\"\u003e# very last position\u003c/span\u003e\u003c/span\u003e\n\u003cspan id=\"cb7-98\"\u003e\u003ca href=\"#cb7-98\" aria-hidden=\"true\" tabindex=\"-1\"\u003e\u003c/a\u003e            \u003cspan class=\"co\"\u003e# note: using list [-1] to preserve the time dim\u003c/span\u003e\u003c/span\u003e\n\u003cspan id=\"cb7-99\"\u003e\u003ca href=\"#cb7-99\" aria-hidden=\"true\" tabindex=\"-1\"\u003e\u003c/a\u003e            logits \u003cspan class=\"op\"\u003e=\u003c/span\u003e \u003cspan class=\"va\"\u003eself\u003c/span\u003e.lm_head(x[:, [\u003cspan class=\"op\"\u003e-\u003c/span\u003e\u003cspan class=\"dv\"\u003e1\u003c/span\u003e], :])\u003c/span\u003e\n\u003cspan id=\"cb7-100\"\u003e\u003ca href=\"#cb7-100\" aria-hidden=\"true\" tabindex=\"-1\"\u003e\u003c/a\u003e            loss \u003cspan class=\"op\"\u003e=\u003c/span\u003e \u003cspan class=\"va\"\u003eNone\u003c/span\u003e\u003c/span\u003e\n\u003cspan id=\"cb7-101\"\u003e\u003ca href=\"#cb7-101\" aria-hidden=\"true\" tabindex=\"-1\"\u003e\u003c/a\u003e\u003c/span\u003e\n\u003cspan id=\"cb7-102\"\u003e\u003ca href=\"#cb7-102\" aria-hidden=\"true\" tabindex=\"-1\"\u003e\u003c/a\u003e        \u003cspan class=\"cf\"\u003ereturn\u003c/span\u003e logits, loss\u003c/span\u003e\n\u003cspan id=\"cb7-103\"\u003e\u003ca href=\"#cb7-103\" aria-hidden=\"true\" tabindex=\"-1\"\u003e\u003c/a\u003e\u003c/span\u003e\n\u003cspan id=\"cb7-104\"\u003e\u003ca href=\"#cb7-104\" aria-hidden=\"true\" tabindex=\"-1\"\u003e\u003c/a\u003e    \u003cspan class=\"kw\"\u003edef\u003c/span\u003e crop_block_size(\u003cspan class=\"va\"\u003eself\u003c/span\u003e, block_size):\u003c/span\u003e\n\u003cspan id=\"cb7-105\"\u003e\u003ca href=\"#cb7-105\" aria-hidden=\"true\" tabindex=\"-1\"\u003e\u003c/a\u003e        \u003cspan class=\"co\"\u003e# model surgery to decrease the block size if necessary e.g. we may\u003c/span\u003e\u003c/span\u003e\n\u003cspan id=\"cb7-106\"\u003e\u003ca href=\"#cb7-106\" aria-hidden=\"true\" tabindex=\"-1\"\u003e\u003c/a\u003e        \u003cspan class=\"co\"\u003e# load the GPT2 pretrained model checkpoint (block size 1024) but want\u003c/span\u003e\u003c/span\u003e\n\u003cspan id=\"cb7-107\"\u003e\u003ca href=\"#cb7-107\" aria-hidden=\"true\" tabindex=\"-1\"\u003e\u003c/a\u003e        \u003cspan class=\"co\"\u003e# to use a smaller block size for some smaller, simpler model\u003c/span\u003e\u003c/span\u003e\n\u003cspan id=\"cb7-108\"\u003e\u003ca href=\"#cb7-108\" aria-hidden=\"true\" tabindex=\"-1\"\u003e\u003c/a\u003e        \u003cspan class=\"cf\"\u003eassert\u003c/span\u003e block_size \u003cspan class=\"op\"\u003e\u0026lt;=\u003c/span\u003e \u003cspan class=\"va\"\u003eself\u003c/span\u003e.config.block_size\u003c/span\u003e\n\u003cspan id=\"cb7-109\"\u003e\u003ca href=\"#cb7-109\" aria-hidden=\"true\" tabindex=\"-1\"\u003e\u003c/a\u003e        \u003cspan class=\"va\"\u003eself\u003c/span\u003e.config.block_size \u003cspan class=\"op\"\u003e=\u003c/span\u003e block_size\u003c/span\u003e\n\u003cspan id=\"cb7-110\"\u003e\u003ca href=\"#cb7-110\" aria-hidden=\"true\" tabindex=\"-1\"\u003e\u003c/a\u003e        \u003cspan class=\"va\"\u003eself\u003c/span\u003e.transformer.wpe.weight \u003cspan class=\"op\"\u003e=\u003c/span\u003e (  \u003cspan class=\"co\"\u003e# type:ignore\u003c/span\u003e\u003c/span\u003e\n\u003cspan id=\"cb7-111\"\u003e\u003ca href=\"#cb7-111\" aria-hidden=\"true\" tabindex=\"-1\"\u003e\u003c/a\u003e            nn.Parameter(\u003c/span\u003e\n\u003cspan id=\"cb7-112\"\u003e\u003ca href=\"#cb7-112\" aria-hidden=\"true\" tabindex=\"-1\"\u003e\u003c/a\u003e                \u003cspan class=\"va\"\u003eself\u003c/span\u003e.transformer.wpe.weight[:block_size]  \u003cspan class=\"co\"\u003e# type:ignore\u003c/span\u003e\u003c/span\u003e\n\u003cspan id=\"cb7-113\"\u003e\u003ca href=\"#cb7-113\" aria-hidden=\"true\" tabindex=\"-1\"\u003e\u003c/a\u003e            )\u003c/span\u003e\n\u003cspan id=\"cb7-114\"\u003e\u003ca href=\"#cb7-114\" aria-hidden=\"true\" tabindex=\"-1\"\u003e\u003c/a\u003e        )\u003c/span\u003e\n\u003cspan id=\"cb7-115\"\u003e\u003ca href=\"#cb7-115\" aria-hidden=\"true\" tabindex=\"-1\"\u003e\u003c/a\u003e        \u003cspan class=\"cf\"\u003efor\u003c/span\u003e block \u003cspan class=\"kw\"\u003ein\u003c/span\u003e \u003cspan class=\"va\"\u003eself\u003c/span\u003e.transformer.h:   \u003cspan class=\"co\"\u003e# type:ignore\u003c/span\u003e\u003c/span\u003e\n\u003cspan id=\"cb7-116\"\u003e\u003ca href=\"#cb7-116\" aria-hidden=\"true\" tabindex=\"-1\"\u003e\u003c/a\u003e            \u003cspan class=\"cf\"\u003eif\u003c/span\u003e \u003cspan class=\"bu\"\u003ehasattr\u003c/span\u003e(block.attn, \u003cspan class=\"st\"\u003e'bias'\u003c/span\u003e):\u003c/span\u003e\n\u003cspan id=\"cb7-117\"\u003e\u003ca href=\"#cb7-117\" aria-hidden=\"true\" tabindex=\"-1\"\u003e\u003c/a\u003e                block.attn.bias \u003cspan class=\"op\"\u003e=\u003c/span\u003e (\u003c/span\u003e\n\u003cspan id=\"cb7-118\"\u003e\u003ca href=\"#cb7-118\" aria-hidden=\"true\" tabindex=\"-1\"\u003e\u003c/a\u003e                    block.attn.bias[:, :, :block_size, :block_size]\u003c/span\u003e\n\u003cspan id=\"cb7-119\"\u003e\u003ca href=\"#cb7-119\" aria-hidden=\"true\" tabindex=\"-1\"\u003e\u003c/a\u003e                )\u003c/span\u003e\n\u003cspan id=\"cb7-120\"\u003e\u003ca href=\"#cb7-120\" aria-hidden=\"true\" tabindex=\"-1\"\u003e\u003c/a\u003e\u003c/span\u003e\n\u003cspan id=\"cb7-121\"\u003e\u003ca href=\"#cb7-121\" aria-hidden=\"true\" tabindex=\"-1\"\u003e\u003c/a\u003e    \u003cspan class=\"at\"\u003e@classmethod\u003c/span\u003e\u003c/span\u003e\n\u003cspan id=\"cb7-122\"\u003e\u003ca href=\"#cb7-122\" aria-hidden=\"true\" tabindex=\"-1\"\u003e\u003c/a\u003e    \u003cspan class=\"kw\"\u003edef\u003c/span\u003e from_pretrained(cls, model_type, override_args\u003cspan class=\"op\"\u003e=\u003c/span\u003e\u003cspan class=\"va\"\u003eNone\u003c/span\u003e):\u003c/span\u003e\n\u003cspan id=\"cb7-123\"\u003e\u003ca href=\"#cb7-123\" aria-hidden=\"true\" tabindex=\"-1\"\u003e\u003c/a\u003e        \u003cspan class=\"cf\"\u003eassert\u003c/span\u003e model_type \u003cspan class=\"kw\"\u003ein\u003c/span\u003e {\u003cspan class=\"st\"\u003e'gpt2'\u003c/span\u003e, \u003cspan class=\"st\"\u003e'gpt2-medium'\u003c/span\u003e, \u003cspan class=\"st\"\u003e'gpt2-large'\u003c/span\u003e, \u003cspan class=\"st\"\u003e'gpt2-xl'\u003c/span\u003e}\u003c/span\u003e\n\u003cspan id=\"cb7-124\"\u003e\u003ca href=\"#cb7-124\" aria-hidden=\"true\" tabindex=\"-1\"\u003e\u003c/a\u003e        override_args \u003cspan class=\"op\"\u003e=\u003c/span\u003e override_args \u003cspan class=\"kw\"\u003eor\u003c/span\u003e {}  \u003cspan class=\"co\"\u003e# default to empty dict\u003c/span\u003e\u003c/span\u003e\n\u003cspan id=\"cb7-125\"\u003e\u003ca href=\"#cb7-125\" aria-hidden=\"true\" tabindex=\"-1\"\u003e\u003c/a\u003e        \u003cspan class=\"co\"\u003e# only dropout can be overridden see more notes below\u003c/span\u003e\u003c/span\u003e\n\u003cspan id=\"cb7-126\"\u003e\u003ca href=\"#cb7-126\" aria-hidden=\"true\" tabindex=\"-1\"\u003e\u003c/a\u003e        \u003cspan class=\"cf\"\u003eassert\u003c/span\u003e \u003cspan class=\"bu\"\u003eall\u003c/span\u003e(k \u003cspan class=\"op\"\u003e==\u003c/span\u003e \u003cspan class=\"st\"\u003e'dropout'\u003c/span\u003e \u003cspan class=\"cf\"\u003efor\u003c/span\u003e k \u003cspan class=\"kw\"\u003ein\u003c/span\u003e override_args)\u003c/span\u003e\n\u003cspan id=\"cb7-127\"\u003e\u003ca href=\"#cb7-127\" aria-hidden=\"true\" tabindex=\"-1\"\u003e\u003c/a\u003e        \u003cspan class=\"im\"\u003efrom\u003c/span\u003e transformers \u003cspan class=\"im\"\u003eimport\u003c/span\u003e GPT2LMHeadModel\u003c/span\u003e\n\u003cspan id=\"cb7-128\"\u003e\u003ca href=\"#cb7-128\" aria-hidden=\"true\" tabindex=\"-1\"\u003e\u003c/a\u003e        log.info(\u003cspan class=\"ss\"\u003ef\"loading weights from pretrained gpt: \u003c/span\u003e\u003cspan class=\"sc\"\u003e{\u003c/span\u003emodel_type\u003cspan class=\"op\"\u003e=\u003c/span\u003e\u003cspan class=\"sc\"\u003e}\u003c/span\u003e\u003cspan class=\"ss\"\u003e\"\u003c/span\u003e)\u003c/span\u003e\n\u003cspan id=\"cb7-129\"\u003e\u003ca href=\"#cb7-129\" aria-hidden=\"true\" tabindex=\"-1\"\u003e\u003c/a\u003e        \u003cspan class=\"co\"\u003e# n_layer, n_head and n_embd are determined from model_type\u003c/span\u003e\u003c/span\u003e\n\u003cspan id=\"cb7-130\"\u003e\u003ca href=\"#cb7-130\" aria-hidden=\"true\" tabindex=\"-1\"\u003e\u003c/a\u003e        \u003cspan class=\"co\"\u003e# gpt2: 124M params\u003c/span\u003e\u003c/span\u003e\n\u003cspan id=\"cb7-131\"\u003e\u003ca href=\"#cb7-131\" aria-hidden=\"true\" tabindex=\"-1\"\u003e\u003c/a\u003e        \u003cspan class=\"co\"\u003e# gpt2-medium: 350M params\u003c/span\u003e\u003c/span\u003e\n\u003cspan id=\"cb7-132\"\u003e\u003ca href=\"#cb7-132\" aria-hidden=\"true\" tabindex=\"-1\"\u003e\u003c/a\u003e        \u003cspan class=\"co\"\u003e# gpt2-large: 774M params\u003c/span\u003e\u003c/span\u003e\n\u003cspan id=\"cb7-133\"\u003e\u003ca href=\"#cb7-133\" aria-hidden=\"true\" tabindex=\"-1\"\u003e\u003c/a\u003e        \u003cspan class=\"co\"\u003e# gpt2-xl: 1558M params\u003c/span\u003e\u003c/span\u003e\n\u003cspan id=\"cb7-134\"\u003e\u003ca href=\"#cb7-134\" aria-hidden=\"true\" tabindex=\"-1\"\u003e\u003c/a\u003e        config_args \u003cspan class=\"op\"\u003e=\u003c/span\u003e {\u003c/span\u003e\n\u003cspan id=\"cb7-135\"\u003e\u003ca href=\"#cb7-135\" aria-hidden=\"true\" tabindex=\"-1\"\u003e\u003c/a\u003e            \u003cspan class=\"co\"\u003e# 'baby-llama2': dict(n_layer=16, n_head=16, n_embed=1024),\u003c/span\u003e\u003c/span\u003e\n\u003cspan id=\"cb7-136\"\u003e\u003ca href=\"#cb7-136\" aria-hidden=\"true\" tabindex=\"-1\"\u003e\u003c/a\u003e            \u003cspan class=\"co\"\u003e# 'llama2-7b': dict(n_layer=32, n_head=32, n_embd=4096),\u003c/span\u003e\u003c/span\u003e\n\u003cspan id=\"cb7-137\"\u003e\u003ca href=\"#cb7-137\" aria-hidden=\"true\" tabindex=\"-1\"\u003e\u003c/a\u003e            \u003cspan class=\"st\"\u003e'gpt2'\u003c/span\u003e: \u003cspan class=\"bu\"\u003edict\u003c/span\u003e(n_layer\u003cspan class=\"op\"\u003e=\u003c/span\u003e\u003cspan class=\"dv\"\u003e12\u003c/span\u003e, n_head\u003cspan class=\"op\"\u003e=\u003c/span\u003e\u003cspan class=\"dv\"\u003e12\u003c/span\u003e, n_embd\u003cspan class=\"op\"\u003e=\u003c/span\u003e\u003cspan class=\"dv\"\u003e768\u003c/span\u003e),\u003c/span\u003e\n\u003cspan id=\"cb7-138\"\u003e\u003ca href=\"#cb7-138\" aria-hidden=\"true\" tabindex=\"-1\"\u003e\u003c/a\u003e            \u003cspan class=\"st\"\u003e'gpt2-medium'\u003c/span\u003e: \u003cspan class=\"bu\"\u003edict\u003c/span\u003e(n_layer\u003cspan class=\"op\"\u003e=\u003c/span\u003e\u003cspan class=\"dv\"\u003e24\u003c/span\u003e, n_head\u003cspan class=\"op\"\u003e=\u003c/span\u003e\u003cspan class=\"dv\"\u003e16\u003c/span\u003e, n_embd\u003cspan class=\"op\"\u003e=\u003c/span\u003e\u003cspan class=\"dv\"\u003e1024\u003c/span\u003e),\u003c/span\u003e\n\u003cspan id=\"cb7-139\"\u003e\u003ca href=\"#cb7-139\" aria-hidden=\"true\" tabindex=\"-1\"\u003e\u003c/a\u003e            \u003cspan class=\"st\"\u003e'gpt2-large'\u003c/span\u003e: \u003cspan class=\"bu\"\u003edict\u003c/span\u003e(n_layer\u003cspan class=\"op\"\u003e=\u003c/span\u003e\u003cspan class=\"dv\"\u003e36\u003c/span\u003e, n_head\u003cspan class=\"op\"\u003e=\u003c/span\u003e\u003cspan class=\"dv\"\u003e20\u003c/span\u003e, n_embd\u003cspan class=\"op\"\u003e=\u003c/span\u003e\u003cspan class=\"dv\"\u003e1280\u003c/span\u003e),\u003c/span\u003e\n\u003cspan id=\"cb7-140\"\u003e\u003ca href=\"#cb7-140\" aria-hidden=\"true\" tabindex=\"-1\"\u003e\u003c/a\u003e            \u003cspan class=\"st\"\u003e'gpt2-xl'\u003c/span\u003e: \u003cspan class=\"bu\"\u003edict\u003c/span\u003e(n_layer\u003cspan class=\"op\"\u003e=\u003c/span\u003e\u003cspan class=\"dv\"\u003e48\u003c/span\u003e, n_head\u003cspan class=\"op\"\u003e=\u003c/span\u003e\u003cspan class=\"dv\"\u003e25\u003c/span\u003e, n_embd\u003cspan class=\"op\"\u003e=\u003c/span\u003e\u003cspan class=\"dv\"\u003e1600\u003c/span\u003e),\u003c/span\u003e\n\u003cspan id=\"cb7-141\"\u003e\u003ca href=\"#cb7-141\" aria-hidden=\"true\" tabindex=\"-1\"\u003e\u003c/a\u003e        }[model_type]\u003c/span\u003e\n\u003cspan id=\"cb7-142\"\u003e\u003ca href=\"#cb7-142\" aria-hidden=\"true\" tabindex=\"-1\"\u003e\u003c/a\u003e        \u003cspan class=\"co\"\u003e# we can override the dropout rate, if desired\u003c/span\u003e\u003c/span\u003e\n\u003cspan id=\"cb7-143\"\u003e\u003ca href=\"#cb7-143\" aria-hidden=\"true\" tabindex=\"-1\"\u003e\u003c/a\u003e        \u003cspan class=\"cf\"\u003eif\u003c/span\u003e \u003cspan class=\"st\"\u003e'dropout'\u003c/span\u003e \u003cspan class=\"kw\"\u003ein\u003c/span\u003e override_args:\u003c/span\u003e\n\u003cspan id=\"cb7-144\"\u003e\u003ca href=\"#cb7-144\" aria-hidden=\"true\" tabindex=\"-1\"\u003e\u003c/a\u003e            log.info(\u003cspan class=\"ss\"\u003ef\"overriding dropout rate to \u003c/span\u003e\u003cspan class=\"sc\"\u003e{\u003c/span\u003eoverride_args[\u003cspan class=\"st\"\u003e'dropout'\u003c/span\u003e]\u003cspan class=\"sc\"\u003e}\u003c/span\u003e\u003cspan class=\"ss\"\u003e\"\u003c/span\u003e)\u003c/span\u003e\n\u003cspan id=\"cb7-145\"\u003e\u003ca href=\"#cb7-145\" aria-hidden=\"true\" tabindex=\"-1\"\u003e\u003c/a\u003e            config_args[\u003cspan class=\"st\"\u003e'dropout'\u003c/span\u003e] \u003cspan class=\"op\"\u003e=\u003c/span\u003e override_args[\u003cspan class=\"st\"\u003e'dropout'\u003c/span\u003e]\u003c/span\u003e\n\u003cspan id=\"cb7-146\"\u003e\u003ca href=\"#cb7-146\" aria-hidden=\"true\" tabindex=\"-1\"\u003e\u003c/a\u003e        \u003cspan class=\"co\"\u003e# create a from-scratch initialized minGPT model\u003c/span\u003e\u003c/span\u003e\n\u003cspan id=\"cb7-147\"\u003e\u003ca href=\"#cb7-147\" aria-hidden=\"true\" tabindex=\"-1\"\u003e\u003c/a\u003e        log.info(\u003cspan class=\"st\"\u003e\"forcing vocab_size=50257, block_size=1024, bias=True\"\u003c/span\u003e)\u003c/span\u003e\n\u003cspan id=\"cb7-148\"\u003e\u003ca href=\"#cb7-148\" aria-hidden=\"true\" tabindex=\"-1\"\u003e\u003c/a\u003e        config \u003cspan class=\"op\"\u003e=\u003c/span\u003e GPTModelConfig(\u003c/span\u003e\n\u003cspan id=\"cb7-149\"\u003e\u003ca href=\"#cb7-149\" aria-hidden=\"true\" tabindex=\"-1\"\u003e\u003c/a\u003e            \u003cspan class=\"op\"\u003e**\u003c/span\u003econfig_args,\u003c/span\u003e\n\u003cspan id=\"cb7-150\"\u003e\u003ca href=\"#cb7-150\" aria-hidden=\"true\" tabindex=\"-1\"\u003e\u003c/a\u003e            block_size\u003cspan class=\"op\"\u003e=\u003c/span\u003e\u003cspan class=\"dv\"\u003e1024\u003c/span\u003e,   \u003cspan class=\"co\"\u003e# always 1024 for GPT model checkpoints\u003c/span\u003e\u003c/span\u003e\n\u003cspan id=\"cb7-151\"\u003e\u003ca href=\"#cb7-151\" aria-hidden=\"true\" tabindex=\"-1\"\u003e\u003c/a\u003e            vocab_size\u003cspan class=\"op\"\u003e=\u003c/span\u003e\u003cspan class=\"dv\"\u003e50257\u003c/span\u003e,  \u003cspan class=\"co\"\u003e# always 50257 for GPT model checkpoints\u003c/span\u003e\u003c/span\u003e\n\u003cspan id=\"cb7-152\"\u003e\u003ca href=\"#cb7-152\" aria-hidden=\"true\" tabindex=\"-1\"\u003e\u003c/a\u003e            bias\u003cspan class=\"op\"\u003e=\u003c/span\u003e\u003cspan class=\"va\"\u003eTrue\u003c/span\u003e,         \u003cspan class=\"co\"\u003e# always True for GPT model checkpoints\u003c/span\u003e\u003c/span\u003e\n\u003cspan id=\"cb7-153\"\u003e\u003ca href=\"#cb7-153\" aria-hidden=\"true\" tabindex=\"-1\"\u003e\u003c/a\u003e        )\u003c/span\u003e\n\u003cspan id=\"cb7-154\"\u003e\u003ca href=\"#cb7-154\" aria-hidden=\"true\" tabindex=\"-1\"\u003e\u003c/a\u003e        model \u003cspan class=\"op\"\u003e=\u003c/span\u003e GPT(config)\u003c/span\u003e\n\u003cspan id=\"cb7-155\"\u003e\u003ca href=\"#cb7-155\" aria-hidden=\"true\" tabindex=\"-1\"\u003e\u003c/a\u003e        sd \u003cspan class=\"op\"\u003e=\u003c/span\u003e model.state_dict()\u003c/span\u003e\n\u003cspan id=\"cb7-156\"\u003e\u003ca href=\"#cb7-156\" aria-hidden=\"true\" tabindex=\"-1\"\u003e\u003c/a\u003e        sd_keys \u003cspan class=\"op\"\u003e=\u003c/span\u003e sd.keys()\u003c/span\u003e\n\u003cspan id=\"cb7-157\"\u003e\u003ca href=\"#cb7-157\" aria-hidden=\"true\" tabindex=\"-1\"\u003e\u003c/a\u003e        sd_keys \u003cspan class=\"op\"\u003e=\u003c/span\u003e [\u003c/span\u003e\n\u003cspan id=\"cb7-158\"\u003e\u003ca href=\"#cb7-158\" aria-hidden=\"true\" tabindex=\"-1\"\u003e\u003c/a\u003e            k \u003cspan class=\"cf\"\u003efor\u003c/span\u003e k \u003cspan class=\"kw\"\u003ein\u003c/span\u003e sd_keys \u003cspan class=\"cf\"\u003eif\u003c/span\u003e \u003cspan class=\"kw\"\u003enot\u003c/span\u003e k.endswith(\u003cspan class=\"st\"\u003e'.attn.bias'\u003c/span\u003e)\u003c/span\u003e\n\u003cspan id=\"cb7-159\"\u003e\u003ca href=\"#cb7-159\" aria-hidden=\"true\" tabindex=\"-1\"\u003e\u003c/a\u003e        ]  \u003cspan class=\"co\"\u003e# discard this mask / buffer, not a param\u003c/span\u003e\u003c/span\u003e\n\u003cspan id=\"cb7-160\"\u003e\u003ca href=\"#cb7-160\" aria-hidden=\"true\" tabindex=\"-1\"\u003e\u003c/a\u003e\u003c/span\u003e\n\u003cspan id=\"cb7-161\"\u003e\u003ca href=\"#cb7-161\" aria-hidden=\"true\" tabindex=\"-1\"\u003e\u003c/a\u003e        \u003cspan class=\"co\"\u003e# init a huggingface/transformers model\u003c/span\u003e\u003c/span\u003e\n\u003cspan id=\"cb7-162\"\u003e\u003ca href=\"#cb7-162\" aria-hidden=\"true\" tabindex=\"-1\"\u003e\u003c/a\u003e        model_hf \u003cspan class=\"op\"\u003e=\u003c/span\u003e GPT2LMHeadModel.from_pretrained(model_type)\u003c/span\u003e\n\u003cspan id=\"cb7-163\"\u003e\u003ca href=\"#cb7-163\" aria-hidden=\"true\" tabindex=\"-1\"\u003e\u003c/a\u003e        sd_hf \u003cspan class=\"op\"\u003e=\u003c/span\u003e model_hf.state_dict()\u003c/span\u003e\n\u003cspan id=\"cb7-164\"\u003e\u003ca href=\"#cb7-164\" aria-hidden=\"true\" tabindex=\"-1\"\u003e\u003c/a\u003e\u003c/span\u003e\n\u003cspan id=\"cb7-165\"\u003e\u003ca href=\"#cb7-165\" aria-hidden=\"true\" tabindex=\"-1\"\u003e\u003c/a\u003e        \u003cspan class=\"co\"\u003e# copy while ensuring all of the parameters are aligned and match in\u003c/span\u003e\u003c/span\u003e\n\u003cspan id=\"cb7-166\"\u003e\u003ca href=\"#cb7-166\" aria-hidden=\"true\" tabindex=\"-1\"\u003e\u003c/a\u003e        \u003cspan class=\"co\"\u003e# names and shapes\u003c/span\u003e\u003c/span\u003e\n\u003cspan id=\"cb7-167\"\u003e\u003ca href=\"#cb7-167\" aria-hidden=\"true\" tabindex=\"-1\"\u003e\u003c/a\u003e        sd_keys_hf \u003cspan class=\"op\"\u003e=\u003c/span\u003e sd_hf.keys()\u003c/span\u003e\n\u003cspan id=\"cb7-168\"\u003e\u003ca href=\"#cb7-168\" aria-hidden=\"true\" tabindex=\"-1\"\u003e\u003c/a\u003e        sd_keys_hf \u003cspan class=\"op\"\u003e=\u003c/span\u003e [\u003c/span\u003e\n\u003cspan id=\"cb7-169\"\u003e\u003ca href=\"#cb7-169\" aria-hidden=\"true\" tabindex=\"-1\"\u003e\u003c/a\u003e            k \u003cspan class=\"cf\"\u003efor\u003c/span\u003e k \u003cspan class=\"kw\"\u003ein\u003c/span\u003e sd_keys_hf \u003cspan class=\"cf\"\u003eif\u003c/span\u003e \u003cspan class=\"kw\"\u003enot\u003c/span\u003e k.endswith(\u003cspan class=\"st\"\u003e'.attn.masked_bias'\u003c/span\u003e)\u003c/span\u003e\n\u003cspan id=\"cb7-170\"\u003e\u003ca href=\"#cb7-170\" aria-hidden=\"true\" tabindex=\"-1\"\u003e\u003c/a\u003e        ]  \u003cspan class=\"co\"\u003e# ignore these, just a buffer\u003c/span\u003e\u003c/span\u003e\n\u003cspan id=\"cb7-171\"\u003e\u003ca href=\"#cb7-171\" aria-hidden=\"true\" tabindex=\"-1\"\u003e\u003c/a\u003e        sd_keys_hf \u003cspan class=\"op\"\u003e=\u003c/span\u003e [\u003c/span\u003e\n\u003cspan id=\"cb7-172\"\u003e\u003ca href=\"#cb7-172\" aria-hidden=\"true\" tabindex=\"-1\"\u003e\u003c/a\u003e            k \u003cspan class=\"cf\"\u003efor\u003c/span\u003e k \u003cspan class=\"kw\"\u003ein\u003c/span\u003e sd_keys_hf \u003cspan class=\"cf\"\u003eif\u003c/span\u003e \u003cspan class=\"kw\"\u003enot\u003c/span\u003e k.endswith(\u003cspan class=\"st\"\u003e'.attn.bias'\u003c/span\u003e)\u003c/span\u003e\n\u003cspan id=\"cb7-173\"\u003e\u003ca href=\"#cb7-173\" aria-hidden=\"true\" tabindex=\"-1\"\u003e\u003c/a\u003e        ]  \u003cspan class=\"co\"\u003e# same, just the mask (buffer)\u003c/span\u003e\u003c/span\u003e\n\u003cspan id=\"cb7-174\"\u003e\u003ca href=\"#cb7-174\" aria-hidden=\"true\" tabindex=\"-1\"\u003e\u003c/a\u003e        transposed \u003cspan class=\"op\"\u003e=\u003c/span\u003e [\u003c/span\u003e\n\u003cspan id=\"cb7-175\"\u003e\u003ca href=\"#cb7-175\" aria-hidden=\"true\" tabindex=\"-1\"\u003e\u003c/a\u003e            \u003cspan class=\"st\"\u003e'attn.c_attn.weight'\u003c/span\u003e,\u003c/span\u003e\n\u003cspan id=\"cb7-176\"\u003e\u003ca href=\"#cb7-176\" aria-hidden=\"true\" tabindex=\"-1\"\u003e\u003c/a\u003e            \u003cspan class=\"st\"\u003e'attn.c_proj.weight'\u003c/span\u003e,\u003c/span\u003e\n\u003cspan id=\"cb7-177\"\u003e\u003ca href=\"#cb7-177\" aria-hidden=\"true\" tabindex=\"-1\"\u003e\u003c/a\u003e            \u003cspan class=\"st\"\u003e'mlp.c_fc.weight'\u003c/span\u003e,\u003c/span\u003e\n\u003cspan id=\"cb7-178\"\u003e\u003ca href=\"#cb7-178\" aria-hidden=\"true\" tabindex=\"-1\"\u003e\u003c/a\u003e            \u003cspan class=\"st\"\u003e'mlp.c_proj.weight'\u003c/span\u003e\u003c/span\u003e\n\u003cspan id=\"cb7-179\"\u003e\u003ca href=\"#cb7-179\" aria-hidden=\"true\" tabindex=\"-1\"\u003e\u003c/a\u003e        ]\u003c/span\u003e\n\u003cspan id=\"cb7-180\"\u003e\u003ca href=\"#cb7-180\" aria-hidden=\"true\" tabindex=\"-1\"\u003e\u003c/a\u003e        \u003cspan class=\"co\"\u003e# basically the openai checkpoints use a \"Conv1D\" module, but we only\u003c/span\u003e\u003c/span\u003e\n\u003cspan id=\"cb7-181\"\u003e\u003ca href=\"#cb7-181\" aria-hidden=\"true\" tabindex=\"-1\"\u003e\u003c/a\u003e        \u003cspan class=\"co\"\u003e# want to use a vanilla Linear this means that we have to transpose\u003c/span\u003e\u003c/span\u003e\n\u003cspan id=\"cb7-182\"\u003e\u003ca href=\"#cb7-182\" aria-hidden=\"true\" tabindex=\"-1\"\u003e\u003c/a\u003e        \u003cspan class=\"co\"\u003e# these weights when we import them\u003c/span\u003e\u003c/span\u003e\n\u003cspan id=\"cb7-183\"\u003e\u003ca href=\"#cb7-183\" aria-hidden=\"true\" tabindex=\"-1\"\u003e\u003c/a\u003e        \u003cspan class=\"cf\"\u003eassert\u003c/span\u003e \u003cspan class=\"bu\"\u003elen\u003c/span\u003e(sd_keys_hf) \u003cspan class=\"op\"\u003e==\u003c/span\u003e \u003cspan class=\"bu\"\u003elen\u003c/span\u003e(sd_keys), (\u003c/span\u003e\n\u003cspan id=\"cb7-184\"\u003e\u003ca href=\"#cb7-184\" aria-hidden=\"true\" tabindex=\"-1\"\u003e\u003c/a\u003e            \u003cspan class=\"ss\"\u003ef\"mismatched keys: \u003c/span\u003e\u003cspan class=\"sc\"\u003e{\u003c/span\u003e\u003cspan class=\"bu\"\u003elen\u003c/span\u003e(sd_keys_hf)\u003cspan class=\"sc\"\u003e}\u003c/span\u003e\u003cspan class=\"ss\"\u003e != \u003c/span\u003e\u003cspan class=\"sc\"\u003e{\u003c/span\u003e\u003cspan class=\"bu\"\u003elen\u003c/span\u003e(sd_keys)\u003cspan class=\"sc\"\u003e}\u003c/span\u003e\u003cspan class=\"ss\"\u003e\"\u003c/span\u003e\u003c/span\u003e\n\u003cspan id=\"cb7-185\"\u003e\u003ca href=\"#cb7-185\" aria-hidden=\"true\" tabindex=\"-1\"\u003e\u003c/a\u003e        )\u003c/span\u003e\n\u003cspan id=\"cb7-186\"\u003e\u003ca href=\"#cb7-186\" aria-hidden=\"true\" tabindex=\"-1\"\u003e\u003c/a\u003e        \u003cspan class=\"cf\"\u003efor\u003c/span\u003e k \u003cspan class=\"kw\"\u003ein\u003c/span\u003e sd_keys_hf:\u003c/span\u003e\n\u003cspan id=\"cb7-187\"\u003e\u003ca href=\"#cb7-187\" aria-hidden=\"true\" tabindex=\"-1\"\u003e\u003c/a\u003e            \u003cspan class=\"cf\"\u003eif\u003c/span\u003e \u003cspan class=\"bu\"\u003eany\u003c/span\u003e(k.endswith(w) \u003cspan class=\"cf\"\u003efor\u003c/span\u003e w \u003cspan class=\"kw\"\u003ein\u003c/span\u003e transposed):\u003c/span\u003e\n\u003cspan id=\"cb7-188\"\u003e\u003ca href=\"#cb7-188\" aria-hidden=\"true\" tabindex=\"-1\"\u003e\u003c/a\u003e                \u003cspan class=\"co\"\u003e# special treatment for the Conv1D weights we need to transpose\u003c/span\u003e\u003c/span\u003e\n\u003cspan id=\"cb7-189\"\u003e\u003ca href=\"#cb7-189\" aria-hidden=\"true\" tabindex=\"-1\"\u003e\u003c/a\u003e                \u003cspan class=\"cf\"\u003eassert\u003c/span\u003e sd_hf[k].shape[::\u003cspan class=\"op\"\u003e-\u003c/span\u003e\u003cspan class=\"dv\"\u003e1\u003c/span\u003e] \u003cspan class=\"op\"\u003e==\u003c/span\u003e sd[k].shape\u003c/span\u003e\n\u003cspan id=\"cb7-190\"\u003e\u003ca href=\"#cb7-190\" aria-hidden=\"true\" tabindex=\"-1\"\u003e\u003c/a\u003e                \u003cspan class=\"cf\"\u003ewith\u003c/span\u003e torch.no_grad():\u003c/span\u003e\n\u003cspan id=\"cb7-191\"\u003e\u003ca href=\"#cb7-191\" aria-hidden=\"true\" tabindex=\"-1\"\u003e\u003c/a\u003e                    sd[k].copy_(sd_hf[k].t())\u003c/span\u003e\n\u003cspan id=\"cb7-192\"\u003e\u003ca href=\"#cb7-192\" aria-hidden=\"true\" tabindex=\"-1\"\u003e\u003c/a\u003e            \u003cspan class=\"cf\"\u003eelse\u003c/span\u003e:\u003c/span\u003e\n\u003cspan id=\"cb7-193\"\u003e\u003ca href=\"#cb7-193\" aria-hidden=\"true\" tabindex=\"-1\"\u003e\u003c/a\u003e                \u003cspan class=\"co\"\u003e# vanilla copy over the other parameters\u003c/span\u003e\u003c/span\u003e\n\u003cspan id=\"cb7-194\"\u003e\u003ca href=\"#cb7-194\" aria-hidden=\"true\" tabindex=\"-1\"\u003e\u003c/a\u003e                \u003cspan class=\"cf\"\u003eassert\u003c/span\u003e sd_hf[k].shape \u003cspan class=\"op\"\u003e==\u003c/span\u003e sd[k].shape\u003c/span\u003e\n\u003cspan id=\"cb7-195\"\u003e\u003ca href=\"#cb7-195\" aria-hidden=\"true\" tabindex=\"-1\"\u003e\u003c/a\u003e                \u003cspan class=\"cf\"\u003ewith\u003c/span\u003e torch.no_grad():\u003c/span\u003e\n\u003cspan id=\"cb7-196\"\u003e\u003ca href=\"#cb7-196\" aria-hidden=\"true\" tabindex=\"-1\"\u003e\u003c/a\u003e                    sd[k].copy_(sd_hf[k])\u003c/span\u003e\n\u003cspan id=\"cb7-197\"\u003e\u003ca href=\"#cb7-197\" aria-hidden=\"true\" tabindex=\"-1\"\u003e\u003c/a\u003e\u003c/span\u003e\n\u003cspan id=\"cb7-198\"\u003e\u003ca href=\"#cb7-198\" aria-hidden=\"true\" tabindex=\"-1\"\u003e\u003c/a\u003e        \u003cspan class=\"cf\"\u003ereturn\u003c/span\u003e model\u003c/span\u003e\n\u003cspan id=\"cb7-199\"\u003e\u003ca href=\"#cb7-199\" aria-hidden=\"true\" tabindex=\"-1\"\u003e\u003c/a\u003e\u003c/span\u003e\n\u003cspan id=\"cb7-200\"\u003e\u003ca href=\"#cb7-200\" aria-hidden=\"true\" tabindex=\"-1\"\u003e\u003c/a\u003e    \u003cspan class=\"kw\"\u003edef\u003c/span\u003e configure_optimizers(\u003c/span\u003e\n\u003cspan id=\"cb7-201\"\u003e\u003ca href=\"#cb7-201\" aria-hidden=\"true\" tabindex=\"-1\"\u003e\u003c/a\u003e            \u003cspan class=\"va\"\u003eself\u003c/span\u003e,\u003c/span\u003e\n\u003cspan id=\"cb7-202\"\u003e\u003ca href=\"#cb7-202\" aria-hidden=\"true\" tabindex=\"-1\"\u003e\u003c/a\u003e            weight_decay,\u003c/span\u003e\n\u003cspan id=\"cb7-203\"\u003e\u003ca href=\"#cb7-203\" aria-hidden=\"true\" tabindex=\"-1\"\u003e\u003c/a\u003e            learning_rate,\u003c/span\u003e\n\u003cspan id=\"cb7-204\"\u003e\u003ca href=\"#cb7-204\" aria-hidden=\"true\" tabindex=\"-1\"\u003e\u003c/a\u003e            betas,\u003c/span\u003e\n\u003cspan id=\"cb7-205\"\u003e\u003ca href=\"#cb7-205\" aria-hidden=\"true\" tabindex=\"-1\"\u003e\u003c/a\u003e            device_type\u003c/span\u003e\n\u003cspan id=\"cb7-206\"\u003e\u003ca href=\"#cb7-206\" aria-hidden=\"true\" tabindex=\"-1\"\u003e\u003c/a\u003e    ):\u003c/span\u003e\n\u003cspan id=\"cb7-207\"\u003e\u003ca href=\"#cb7-207\" aria-hidden=\"true\" tabindex=\"-1\"\u003e\u003c/a\u003e        \u003cspan class=\"co\"\u003e# start with all of the candidate parameters\u003c/span\u003e\u003c/span\u003e\n\u003cspan id=\"cb7-208\"\u003e\u003ca href=\"#cb7-208\" aria-hidden=\"true\" tabindex=\"-1\"\u003e\u003c/a\u003e        \u003cspan class=\"co\"\u003e# filter out those that do not require grad\u003c/span\u003e\u003c/span\u003e\n\u003cspan id=\"cb7-209\"\u003e\u003ca href=\"#cb7-209\" aria-hidden=\"true\" tabindex=\"-1\"\u003e\u003c/a\u003e        \u003cspan class=\"co\"\u003e# param_dict = {\u003c/span\u003e\u003c/span\u003e\n\u003cspan id=\"cb7-210\"\u003e\u003ca href=\"#cb7-210\" aria-hidden=\"true\" tabindex=\"-1\"\u003e\u003c/a\u003e        \u003cspan class=\"co\"\u003e#     pn: p for pn, p in param_dict.items() if p.requires_grad\u003c/span\u003e\u003c/span\u003e\n\u003cspan id=\"cb7-211\"\u003e\u003ca href=\"#cb7-211\" aria-hidden=\"true\" tabindex=\"-1\"\u003e\u003c/a\u003e        \u003cspan class=\"co\"\u003e# }\u003c/span\u003e\u003c/span\u003e\n\u003cspan id=\"cb7-212\"\u003e\u003ca href=\"#cb7-212\" aria-hidden=\"true\" tabindex=\"-1\"\u003e\u003c/a\u003e        param_dict \u003cspan class=\"op\"\u003e=\u003c/span\u003e {\u003c/span\u003e\n\u003cspan id=\"cb7-213\"\u003e\u003ca href=\"#cb7-213\" aria-hidden=\"true\" tabindex=\"-1\"\u003e\u003c/a\u003e            pn: p \u003cspan class=\"cf\"\u003efor\u003c/span\u003e pn, p \u003cspan class=\"kw\"\u003ein\u003c/span\u003e \u003cspan class=\"va\"\u003eself\u003c/span\u003e.named_parameters() \u003cspan class=\"cf\"\u003eif\u003c/span\u003e p.requires_grad\u003c/span\u003e\n\u003cspan id=\"cb7-214\"\u003e\u003ca href=\"#cb7-214\" aria-hidden=\"true\" tabindex=\"-1\"\u003e\u003c/a\u003e        }\u003c/span\u003e\n\u003cspan id=\"cb7-215\"\u003e\u003ca href=\"#cb7-215\" aria-hidden=\"true\" tabindex=\"-1\"\u003e\u003c/a\u003e        \u003cspan class=\"co\"\u003e# create optim groups. Any parameters that is 2D will be weight\u003c/span\u003e\u003c/span\u003e\n\u003cspan id=\"cb7-216\"\u003e\u003ca href=\"#cb7-216\" aria-hidden=\"true\" tabindex=\"-1\"\u003e\u003c/a\u003e        \u003cspan class=\"co\"\u003e# decayed, otherwise no. i.e. all weight tensors in matmuls +\u003c/span\u003e\u003c/span\u003e\n\u003cspan id=\"cb7-217\"\u003e\u003ca href=\"#cb7-217\" aria-hidden=\"true\" tabindex=\"-1\"\u003e\u003c/a\u003e        \u003cspan class=\"co\"\u003e# embeddings decay, all biases and layernorms don't.\u003c/span\u003e\u003c/span\u003e\n\u003cspan id=\"cb7-218\"\u003e\u003ca href=\"#cb7-218\" aria-hidden=\"true\" tabindex=\"-1\"\u003e\u003c/a\u003e        decay_params \u003cspan class=\"op\"\u003e=\u003c/span\u003e [p \u003cspan class=\"cf\"\u003efor\u003c/span\u003e _, p \u003cspan class=\"kw\"\u003ein\u003c/span\u003e param_dict.items() \u003cspan class=\"cf\"\u003eif\u003c/span\u003e p.dim() \u003cspan class=\"op\"\u003e\u0026gt;=\u003c/span\u003e \u003cspan class=\"dv\"\u003e2\u003c/span\u003e]\u003c/span\u003e\n\u003cspan id=\"cb7-219\"\u003e\u003ca href=\"#cb7-219\" aria-hidden=\"true\" tabindex=\"-1\"\u003e\u003c/a\u003e        nodecay_params \u003cspan class=\"op\"\u003e=\u003c/span\u003e [p \u003cspan class=\"cf\"\u003efor\u003c/span\u003e _, p \u003cspan class=\"kw\"\u003ein\u003c/span\u003e param_dict.items() \u003cspan class=\"cf\"\u003eif\u003c/span\u003e p.dim() \u003cspan class=\"op\"\u003e\u0026lt;\u003c/span\u003e \u003cspan class=\"dv\"\u003e2\u003c/span\u003e]\u003c/span\u003e\n\u003cspan id=\"cb7-220\"\u003e\u003ca href=\"#cb7-220\" aria-hidden=\"true\" tabindex=\"-1\"\u003e\u003c/a\u003e        optim_groups \u003cspan class=\"op\"\u003e=\u003c/span\u003e [\u003c/span\u003e\n\u003cspan id=\"cb7-221\"\u003e\u003ca href=\"#cb7-221\" aria-hidden=\"true\" tabindex=\"-1\"\u003e\u003c/a\u003e            {\u003cspan class=\"st\"\u003e'params'\u003c/span\u003e: decay_params, \u003cspan class=\"st\"\u003e'weight_decay'\u003c/span\u003e: weight_decay},\u003c/span\u003e\n\u003cspan id=\"cb7-222\"\u003e\u003ca href=\"#cb7-222\" aria-hidden=\"true\" tabindex=\"-1\"\u003e\u003c/a\u003e            {\u003cspan class=\"st\"\u003e'params'\u003c/span\u003e: nodecay_params, \u003cspan class=\"st\"\u003e'weight_decay'\u003c/span\u003e: \u003cspan class=\"fl\"\u003e0.0\u003c/span\u003e}\u003c/span\u003e\n\u003cspan id=\"cb7-223\"\u003e\u003ca href=\"#cb7-223\" aria-hidden=\"true\" tabindex=\"-1\"\u003e\u003c/a\u003e        ]\u003c/span\u003e\n\u003cspan id=\"cb7-224\"\u003e\u003ca href=\"#cb7-224\" aria-hidden=\"true\" tabindex=\"-1\"\u003e\u003c/a\u003e        num_decay_params \u003cspan class=\"op\"\u003e=\u003c/span\u003e \u003cspan class=\"bu\"\u003esum\u003c/span\u003e(p.numel() \u003cspan class=\"cf\"\u003efor\u003c/span\u003e p \u003cspan class=\"kw\"\u003ein\u003c/span\u003e decay_params)\u003c/span\u003e\n\u003cspan id=\"cb7-225\"\u003e\u003ca href=\"#cb7-225\" aria-hidden=\"true\" tabindex=\"-1\"\u003e\u003c/a\u003e        num_nodecay_params \u003cspan class=\"op\"\u003e=\u003c/span\u003e \u003cspan class=\"bu\"\u003esum\u003c/span\u003e(p.numel() \u003cspan class=\"cf\"\u003efor\u003c/span\u003e p \u003cspan class=\"kw\"\u003ein\u003c/span\u003e nodecay_params)\u003c/span\u003e\n\u003cspan id=\"cb7-226\"\u003e\u003ca href=\"#cb7-226\" aria-hidden=\"true\" tabindex=\"-1\"\u003e\u003c/a\u003e        log.info(\u003c/span\u003e\n\u003cspan id=\"cb7-227\"\u003e\u003ca href=\"#cb7-227\" aria-hidden=\"true\" tabindex=\"-1\"\u003e\u003c/a\u003e            \u003cspan class=\"ss\"\u003ef\"num decayed parameter tensors: \u003c/span\u003e\u003cspan class=\"sc\"\u003e{\u003c/span\u003e\u003cspan class=\"bu\"\u003elen\u003c/span\u003e(decay_params)\u003cspan class=\"sc\"\u003e}\u003c/span\u003e\u003cspan class=\"ss\"\u003e, \"\u003c/span\u003e\u003c/span\u003e\n\u003cspan id=\"cb7-228\"\u003e\u003ca href=\"#cb7-228\" aria-hidden=\"true\" tabindex=\"-1\"\u003e\u003c/a\u003e            \u003cspan class=\"ss\"\u003ef\"with \u003c/span\u003e\u003cspan class=\"sc\"\u003e{\u003c/span\u003enum_decay_params\u003cspan class=\"sc\"\u003e:,}\u003c/span\u003e\u003cspan class=\"ss\"\u003e parameters\"\u003c/span\u003e\u003c/span\u003e\n\u003cspan id=\"cb7-229\"\u003e\u003ca href=\"#cb7-229\" aria-hidden=\"true\" tabindex=\"-1\"\u003e\u003c/a\u003e        )\u003c/span\u003e\n\u003cspan id=\"cb7-230\"\u003e\u003ca href=\"#cb7-230\" aria-hidden=\"true\" tabindex=\"-1\"\u003e\u003c/a\u003e        log.info(\u003c/span\u003e\n\u003cspan id=\"cb7-231\"\u003e\u003ca href=\"#cb7-231\" aria-hidden=\"true\" tabindex=\"-1\"\u003e\u003c/a\u003e            \u003cspan class=\"ss\"\u003ef\"num non-decayed parameter tensors: \u003c/span\u003e\u003cspan class=\"sc\"\u003e{\u003c/span\u003e\u003cspan class=\"bu\"\u003elen\u003c/span\u003e(nodecay_params)\u003cspan class=\"sc\"\u003e}\u003c/span\u003e\u003cspan class=\"ss\"\u003e, \"\u003c/span\u003e\u003c/span\u003e\n\u003cspan id=\"cb7-232\"\u003e\u003ca href=\"#cb7-232\" aria-hidden=\"true\" tabindex=\"-1\"\u003e\u003c/a\u003e            \u003cspan class=\"ss\"\u003ef\"with \u003c/span\u003e\u003cspan class=\"sc\"\u003e{\u003c/span\u003enum_nodecay_params\u003cspan class=\"sc\"\u003e:,}\u003c/span\u003e\u003cspan class=\"ss\"\u003e parameters\"\u003c/span\u003e\u003c/span\u003e\n\u003cspan id=\"cb7-233\"\u003e\u003ca href=\"#cb7-233\" aria-hidden=\"true\" tabindex=\"-1\"\u003e\u003c/a\u003e        )\u003c/span\u003e\n\u003cspan id=\"cb7-234\"\u003e\u003ca href=\"#cb7-234\" aria-hidden=\"true\" tabindex=\"-1\"\u003e\u003c/a\u003e        \u003cspan class=\"co\"\u003e# Create AdamW optimizer and use the fused version if it is available\u003c/span\u003e\u003c/span\u003e\n\u003cspan id=\"cb7-235\"\u003e\u003ca href=\"#cb7-235\" aria-hidden=\"true\" tabindex=\"-1\"\u003e\u003c/a\u003e        fused_available \u003cspan class=\"op\"\u003e=\u003c/span\u003e (\u003c/span\u003e\n\u003cspan id=\"cb7-236\"\u003e\u003ca href=\"#cb7-236\" aria-hidden=\"true\" tabindex=\"-1\"\u003e\u003c/a\u003e            \u003cspan class=\"st\"\u003e'fused'\u003c/span\u003e \u003cspan class=\"kw\"\u003ein\u003c/span\u003e inspect.signature(torch.optim.AdamW).parameters\u003c/span\u003e\n\u003cspan id=\"cb7-237\"\u003e\u003ca href=\"#cb7-237\" aria-hidden=\"true\" tabindex=\"-1\"\u003e\u003c/a\u003e        )\u003c/span\u003e\n\u003cspan id=\"cb7-238\"\u003e\u003ca href=\"#cb7-238\" aria-hidden=\"true\" tabindex=\"-1\"\u003e\u003c/a\u003e        use_fused \u003cspan class=\"op\"\u003e=\u003c/span\u003e fused_available \u003cspan class=\"kw\"\u003eand\u003c/span\u003e device_type \u003cspan class=\"op\"\u003e==\u003c/span\u003e \u003cspan class=\"st\"\u003e'cuda'\u003c/span\u003e\u003c/span\u003e\n\u003cspan id=\"cb7-239\"\u003e\u003ca href=\"#cb7-239\" aria-hidden=\"true\" tabindex=\"-1\"\u003e\u003c/a\u003e        extra_args \u003cspan class=\"op\"\u003e=\u003c/span\u003e \u003cspan class=\"bu\"\u003edict\u003c/span\u003e(fused\u003cspan class=\"op\"\u003e=\u003c/span\u003e\u003cspan class=\"va\"\u003eTrue\u003c/span\u003e) \u003cspan class=\"cf\"\u003eif\u003c/span\u003e use_fused \u003cspan class=\"cf\"\u003eelse\u003c/span\u003e {}\u003c/span\u003e\n\u003cspan id=\"cb7-240\"\u003e\u003ca href=\"#cb7-240\" aria-hidden=\"true\" tabindex=\"-1\"\u003e\u003c/a\u003e        optimizer \u003cspan class=\"op\"\u003e=\u003c/span\u003e torch.optim.AdamW(\u003c/span\u003e\n\u003cspan id=\"cb7-241\"\u003e\u003ca href=\"#cb7-241\" aria-hidden=\"true\" tabindex=\"-1\"\u003e\u003c/a\u003e            optim_groups,\u003c/span\u003e\n\u003cspan id=\"cb7-242\"\u003e\u003ca href=\"#cb7-242\" aria-hidden=\"true\" tabindex=\"-1\"\u003e\u003c/a\u003e            lr\u003cspan class=\"op\"\u003e=\u003c/span\u003elearning_rate,\u003c/span\u003e\n\u003cspan id=\"cb7-243\"\u003e\u003ca href=\"#cb7-243\" aria-hidden=\"true\" tabindex=\"-1\"\u003e\u003c/a\u003e            betas\u003cspan class=\"op\"\u003e=\u003c/span\u003ebetas,\u003c/span\u003e\n\u003cspan id=\"cb7-244\"\u003e\u003ca href=\"#cb7-244\" aria-hidden=\"true\" tabindex=\"-1\"\u003e\u003c/a\u003e            \u003cspan class=\"op\"\u003e**\u003c/span\u003eextra_args\u003c/span\u003e\n\u003cspan id=\"cb7-245\"\u003e\u003ca href=\"#cb7-245\" aria-hidden=\"true\" tabindex=\"-1\"\u003e\u003c/a\u003e        )\u003c/span\u003e\n\u003cspan id=\"cb7-246\"\u003e\u003ca href=\"#cb7-246\" aria-hidden=\"true\" tabindex=\"-1\"\u003e\u003c/a\u003e        log.info(\u003cspan class=\"ss\"\u003ef\"using fused AdamW: \u003c/span\u003e\u003cspan class=\"sc\"\u003e{\u003c/span\u003euse_fused\u003cspan class=\"sc\"\u003e}\u003c/span\u003e\u003cspan class=\"ss\"\u003e\"\u003c/span\u003e)\u003c/span\u003e\n\u003cspan id=\"cb7-247\"\u003e\u003ca href=\"#cb7-247\" aria-hidden=\"true\" tabindex=\"-1\"\u003e\u003c/a\u003e\u003c/span\u003e\n\u003cspan id=\"cb7-248\"\u003e\u003ca href=\"#cb7-248\" aria-hidden=\"true\" tabindex=\"-1\"\u003e\u003c/a\u003e        \u003cspan class=\"cf\"\u003ereturn\u003c/span\u003e optimizer\u003c/span\u003e\n\u003cspan id=\"cb7-249\"\u003e\u003ca href=\"#cb7-249\" aria-hidden=\"true\" tabindex=\"-1\"\u003e\u003c/a\u003e\u003c/span\u003e\n\u003cspan id=\"cb7-250\"\u003e\u003ca href=\"#cb7-250\" aria-hidden=\"true\" tabindex=\"-1\"\u003e\u003c/a\u003e    \u003cspan class=\"kw\"\u003edef\u003c/span\u003e estimate_mfu(\u003cspan class=\"va\"\u003eself\u003c/span\u003e, fwdbwd_per_iter, dt):\u003c/span\u003e\n\u003cspan id=\"cb7-251\"\u003e\u003ca href=\"#cb7-251\" aria-hidden=\"true\" tabindex=\"-1\"\u003e\u003c/a\u003e        \u003cspan class=\"co\"\u003e\"\"\"Estimate model flops utilization (MFU)\u003c/span\u003e\u003c/span\u003e\n\u003cspan id=\"cb7-252\"\u003e\u003ca href=\"#cb7-252\" aria-hidden=\"true\" tabindex=\"-1\"\u003e\u003c/a\u003e\u003c/span\u003e\n\u003cspan id=\"cb7-253\"\u003e\u003ca href=\"#cb7-253\" aria-hidden=\"true\" tabindex=\"-1\"\u003e\u003c/a\u003e\u003cspan class=\"co\"\u003e        (in units of A100 bfloat16 peak FLOPS)\u003c/span\u003e\u003c/span\u003e\n\u003cspan id=\"cb7-254\"\u003e\u003ca href=\"#cb7-254\" aria-hidden=\"true\" tabindex=\"-1\"\u003e\u003c/a\u003e\u003cspan class=\"co\"\u003e        \"\"\"\u003c/span\u003e\u003c/span\u003e\n\u003cspan id=\"cb7-255\"\u003e\u003ca href=\"#cb7-255\" aria-hidden=\"true\" tabindex=\"-1\"\u003e\u003c/a\u003e        \u003cspan class=\"co\"\u003e# first estimate the number of flops we do per iteration.\u003c/span\u003e\u003c/span\u003e\n\u003cspan id=\"cb7-256\"\u003e\u003ca href=\"#cb7-256\" aria-hidden=\"true\" tabindex=\"-1\"\u003e\u003c/a\u003e        \u003cspan class=\"co\"\u003e# see PaLM paper Appendix B as ref: https://arxiv.org/abs/2204.02311\u003c/span\u003e\u003c/span\u003e\n\u003cspan id=\"cb7-257\"\u003e\u003ca href=\"#cb7-257\" aria-hidden=\"true\" tabindex=\"-1\"\u003e\u003c/a\u003e        N \u003cspan class=\"op\"\u003e=\u003c/span\u003e \u003cspan class=\"va\"\u003eself\u003c/span\u003e.get_num_params()\u003c/span\u003e\n\u003cspan id=\"cb7-258\"\u003e\u003ca href=\"#cb7-258\" aria-hidden=\"true\" tabindex=\"-1\"\u003e\u003c/a\u003e        cfg \u003cspan class=\"op\"\u003e=\u003c/span\u003e \u003cspan class=\"va\"\u003eself\u003c/span\u003e.config\u003c/span\u003e\n\u003cspan id=\"cb7-259\"\u003e\u003ca href=\"#cb7-259\" aria-hidden=\"true\" tabindex=\"-1\"\u003e\u003c/a\u003e        L, H, Q, T \u003cspan class=\"op\"\u003e=\u003c/span\u003e (\u003c/span\u003e\n\u003cspan id=\"cb7-260\"\u003e\u003ca href=\"#cb7-260\" aria-hidden=\"true\" tabindex=\"-1\"\u003e\u003c/a\u003e            cfg.n_layer,\u003c/span\u003e\n\u003cspan id=\"cb7-261\"\u003e\u003ca href=\"#cb7-261\" aria-hidden=\"true\" tabindex=\"-1\"\u003e\u003c/a\u003e            cfg.n_head,\u003c/span\u003e\n\u003cspan id=\"cb7-262\"\u003e\u003ca href=\"#cb7-262\" aria-hidden=\"true\" tabindex=\"-1\"\u003e\u003c/a\u003e            cfg.n_embd\u003cspan class=\"op\"\u003e//\u003c/span\u003ecfg.n_head,\u003c/span\u003e\n\u003cspan id=\"cb7-263\"\u003e\u003ca href=\"#cb7-263\" aria-hidden=\"true\" tabindex=\"-1\"\u003e\u003c/a\u003e            cfg.block_size\u003c/span\u003e\n\u003cspan id=\"cb7-264\"\u003e\u003ca href=\"#cb7-264\" aria-hidden=\"true\" tabindex=\"-1\"\u003e\u003c/a\u003e        )\u003c/span\u003e\n\u003cspan id=\"cb7-265\"\u003e\u003ca href=\"#cb7-265\" aria-hidden=\"true\" tabindex=\"-1\"\u003e\u003c/a\u003e        flops_per_token \u003cspan class=\"op\"\u003e=\u003c/span\u003e \u003cspan class=\"dv\"\u003e6\u003c/span\u003e\u003cspan class=\"op\"\u003e*\u003c/span\u003eN \u003cspan class=\"op\"\u003e+\u003c/span\u003e \u003cspan class=\"dv\"\u003e12\u003c/span\u003e\u003cspan class=\"op\"\u003e*\u003c/span\u003eL\u003cspan class=\"op\"\u003e*\u003c/span\u003eH\u003cspan class=\"op\"\u003e*\u003c/span\u003eQ\u003cspan class=\"op\"\u003e*\u003c/span\u003eT\u003c/span\u003e\n\u003cspan id=\"cb7-266\"\u003e\u003ca href=\"#cb7-266\" aria-hidden=\"true\" tabindex=\"-1\"\u003e\u003c/a\u003e        flops_per_fwdbwd \u003cspan class=\"op\"\u003e=\u003c/span\u003e flops_per_token \u003cspan class=\"op\"\u003e*\u003c/span\u003e T\u003c/span\u003e\n\u003cspan id=\"cb7-267\"\u003e\u003ca href=\"#cb7-267\" aria-hidden=\"true\" tabindex=\"-1\"\u003e\u003c/a\u003e        flops_per_iter \u003cspan class=\"op\"\u003e=\u003c/span\u003e flops_per_fwdbwd \u003cspan class=\"op\"\u003e*\u003c/span\u003e fwdbwd_per_iter\u003c/span\u003e\n\u003cspan id=\"cb7-268\"\u003e\u003ca href=\"#cb7-268\" aria-hidden=\"true\" tabindex=\"-1\"\u003e\u003c/a\u003e        \u003cspan class=\"co\"\u003e# express our flops throughput as ratio of A100 bfloat16 peak flops\u003c/span\u003e\u003c/span\u003e\n\u003cspan id=\"cb7-269\"\u003e\u003ca href=\"#cb7-269\" aria-hidden=\"true\" tabindex=\"-1\"\u003e\u003c/a\u003e        flops_achieved \u003cspan class=\"op\"\u003e=\u003c/span\u003e flops_per_iter \u003cspan class=\"op\"\u003e*\u003c/span\u003e (\u003cspan class=\"fl\"\u003e1.0\u003c/span\u003e\u003cspan class=\"op\"\u003e/\u003c/span\u003edt)  \u003cspan class=\"co\"\u003e# per second\u003c/span\u003e\u003c/span\u003e\n\u003cspan id=\"cb7-270\"\u003e\u003ca href=\"#cb7-270\" aria-hidden=\"true\" tabindex=\"-1\"\u003e\u003c/a\u003e        flops_promised \u003cspan class=\"op\"\u003e=\u003c/span\u003e \u003cspan class=\"fl\"\u003e312e12\u003c/span\u003e  \u003cspan class=\"co\"\u003e# A100 GPU bfloat16 peak flops is 312 TFLOPS\u003c/span\u003e\u003c/span\u003e\n\u003cspan id=\"cb7-271\"\u003e\u003ca href=\"#cb7-271\" aria-hidden=\"true\" tabindex=\"-1\"\u003e\u003c/a\u003e        \u003cspan class=\"cf\"\u003ereturn\u003c/span\u003e flops_achieved \u003cspan class=\"op\"\u003e/\u003c/span\u003e flops_promised\u003c/span\u003e\n\u003cspan id=\"cb7-272\"\u003e\u003ca href=\"#cb7-272\" aria-hidden=\"true\" tabindex=\"-1\"\u003e\u003c/a\u003e\u003c/span\u003e\n\u003cspan id=\"cb7-273\"\u003e\u003ca href=\"#cb7-273\" aria-hidden=\"true\" tabindex=\"-1\"\u003e\u003c/a\u003e    \u003cspan class=\"at\"\u003e@torch.no_grad\u003c/span\u003e()\u003c/span\u003e\n\u003cspan id=\"cb7-274\"\u003e\u003ca href=\"#cb7-274\" aria-hidden=\"true\" tabindex=\"-1\"\u003e\u003c/a\u003e    \u003cspan class=\"kw\"\u003edef\u003c/span\u003e generate(\u003cspan class=\"va\"\u003eself\u003c/span\u003e, idx, max_new_tokens, temperature\u003cspan class=\"op\"\u003e=\u003c/span\u003e\u003cspan class=\"fl\"\u003e1.0\u003c/span\u003e, top_k\u003cspan class=\"op\"\u003e=\u003c/span\u003e\u003cspan class=\"va\"\u003eNone\u003c/span\u003e):\u003c/span\u003e\n\u003cspan id=\"cb7-275\"\u003e\u003ca href=\"#cb7-275\" aria-hidden=\"true\" tabindex=\"-1\"\u003e\u003c/a\u003e        \u003cspan class=\"co\"\u003e\"\"\"\u003c/span\u003e\u003c/span\u003e\n\u003cspan id=\"cb7-276\"\u003e\u003ca href=\"#cb7-276\" aria-hidden=\"true\" tabindex=\"-1\"\u003e\u003c/a\u003e\u003cspan class=\"co\"\u003e        Take a conditioning sequence of indices idx (LongTensor of shape (b,t))\u003c/span\u003e\u003c/span\u003e\n\u003cspan id=\"cb7-277\"\u003e\u003ca href=\"#cb7-277\" aria-hidden=\"true\" tabindex=\"-1\"\u003e\u003c/a\u003e\u003cspan class=\"co\"\u003e        and complete the sequence max_new_tokens times, feeding the predictions\u003c/span\u003e\u003c/span\u003e\n\u003cspan id=\"cb7-278\"\u003e\u003ca href=\"#cb7-278\" aria-hidden=\"true\" tabindex=\"-1\"\u003e\u003c/a\u003e\u003cspan class=\"co\"\u003e        back into the model each time.\u003c/span\u003e\u003c/span\u003e\n\u003cspan id=\"cb7-279\"\u003e\u003ca href=\"#cb7-279\" aria-hidden=\"true\" tabindex=\"-1\"\u003e\u003c/a\u003e\u003c/span\u003e\n\u003cspan id=\"cb7-280\"\u003e\u003ca href=\"#cb7-280\" aria-hidden=\"true\" tabindex=\"-1\"\u003e\u003c/a\u003e\u003cspan class=\"co\"\u003e        Most likely you'll want to make sure to be in model.eval() mode of\u003c/span\u003e\u003c/span\u003e\n\u003cspan id=\"cb7-281\"\u003e\u003ca href=\"#cb7-281\" aria-hidden=\"true\" tabindex=\"-1\"\u003e\u003c/a\u003e\u003cspan class=\"co\"\u003e        operation for this.\u003c/span\u003e\u003c/span\u003e\n\u003cspan id=\"cb7-282\"\u003e\u003ca href=\"#cb7-282\" aria-hidde","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fsaforem2%2Fllm-workshop-talk","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Fsaforem2%2Fllm-workshop-talk","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fsaforem2%2Fllm-workshop-talk/lists"}