awesome-transformer-nlp

A curated list of NLP resources focused on Transformer networks, attention mechanism, GPT, BERT, ChatGPT, LLMs, and transfer learning.
https://github.com/cedrickchee/awesome-transformer-nlp

Last synced: about 6 hours ago
JSON representation

Articles
- BERT and Transformer
  - What is XLNet and why it outperforms BERT
  - What is XLNet and why it outperforms BERT
  - What is XLNet and why it outperforms BERT
  - What is XLNet and why it outperforms BERT
  - The Illustrated BERT, ELMo, and co. (How NLP Cracked Transfer Learning)
  - Dissecting BERT - Understand BERT in depth with an intuitive, straightforward explanation of the relevant concepts.
  - A Light Introduction to Transformer-XL
  - Generalized Language Models
  - ALBERT: A Lite BERT for Self-supervised Learning of Language Representations paper - layer parameter sharing, and Sentence Order Prediction (SOP) loss to model inter-sentence coherence. [[Blog post](https://ai.googleblog.com/2019/12/albert-lite-bert-for-self-supervised.html) | [Code](https://github.com/google-research/ALBERT)]
  - ELECTRA: Pre-training Text Encoders as Discriminators Rather Than Generators - Thang Luong, Quoc V. Le, and Christopher D. Manning - A BERT variant like ALBERT and cost less to train. They trained a model that outperforms GPT by using only one GPU; match the performance of RoBERTa by using 1/4 computation. It uses a new pre-training approach, called replaced token detection (RTD), that trains a bidirectional model while learning from all input positions. [[Blog post](https://ai.googleblog.com/2020/03/more-efficient-nlp-model-pre-training.html) | [Code](https://github.com/google-research/electra)]
  - Visual Paper Summary: ALBERT (A Lite BERT)
  - Cramming: Training a Language Model on a Single GPU in One Day (paper) - While most in the community are asking how to push the limits of extreme computation, we ask the opposite question: How far can we get with a single GPU in just one day? ... Through the lens of scaling laws, we categorize a range of recent improvements to training and architecture and discuss their merit and practical applicability (or lack thereof) for the limited compute setting.
  - What is XLNet and why it outperforms BERT
  - What is XLNet and why it outperforms BERT
  - What is XLNet and why it outperforms BERT
  - What is XLNet and why it outperforms BERT
  - What is XLNet and why it outperforms BERT
  - What is XLNet and why it outperforms BERT
  - What is XLNet and why it outperforms BERT
  - What is XLNet and why it outperforms BERT
  - What is XLNet and why it outperforms BERT
  - What is XLNet and why it outperforms BERT
  - What is XLNet and why it outperforms BERT
  - What is XLNet and why it outperforms BERT
  - Open Sourcing BERT: State-of-the-Art Pre-training for Natural Language Processing
  - What is XLNet and why it outperforms BERT
  - What is XLNet and why it outperforms BERT
  - What is XLNet and why it outperforms BERT
  - What is XLNet and why it outperforms BERT
  - What is XLNet and why it outperforms BERT
  - What is XLNet and why it outperforms BERT
  - What is XLNet and why it outperforms BERT
  - What is XLNet and why it outperforms BERT
  - What is XLNet and why it outperforms BERT
  - What is XLNet and why it outperforms BERT
  - What is XLNet and why it outperforms BERT
  - What is XLNet and why it outperforms BERT
  - What is XLNet and why it outperforms BERT
  - What is XLNet and why it outperforms BERT
  - What is XLNet and why it outperforms BERT
  - What is XLNet and why it outperforms BERT
  - What is XLNet and why it outperforms BERT
  - What is XLNet and why it outperforms BERT
  - What is XLNet and why it outperforms BERT
  - What is XLNet and why it outperforms BERT
  - What is XLNet and why it outperforms BERT
  - What is XLNet and why it outperforms BERT
  - What is XLNet and why it outperforms BERT
  - What is XLNet and why it outperforms BERT
  - What is XLNet and why it outperforms BERT
  - What is XLNet and why it outperforms BERT
  - What happened to BERT & T5? On Transformer Encoders, PrefixLM and Denoising Objectives
  - What is XLNet and why it outperforms BERT
  - What is XLNet and why it outperforms BERT
  - What is XLNet and why it outperforms BERT
  - What is XLNet and why it outperforms BERT
  - What is XLNet and why it outperforms BERT
  - What is XLNet and why it outperforms BERT
  - What is XLNet and why it outperforms BERT
  - What is XLNet and why it outperforms BERT
  - What is XLNet and why it outperforms BERT
  - What is XLNet and why it outperforms BERT
  - What is XLNet and why it outperforms BERT
  - Open Sourcing BERT: State-of-the-Art Pre-training for Natural Language Processing
- Transformer Architecture
  - The Secret Sauce behind 100K context window in LLMs: all tricks in one place
  - The Secret Sauce behind 100K context window in LLMs: all tricks in one place
  - The Secret Sauce behind 100K context window in LLMs: all tricks in one place
  - The Secret Sauce behind 100K context window in LLMs: all tricks in one place
  - The Secret Sauce behind 100K context window in LLMs: all tricks in one place
  - The Secret Sauce behind 100K context window in LLMs: all tricks in one place
  - DeepNet: Scaling Transformers to 1,000 Layers (paper) - The group introduced a **new normalization function (DEEPNORM)** to modify the residual connection in Transformer and showed that model updates can be bounded in a **stable way**. This improve the training stability of deep Transformers and scale the model depth by orders of magnitude (10x) compared to Gpipe (pipeline parallelism) by Google Brain (2019). (who remembers what ResNet (2015) did to ConvNet?)
  - A Length-Extrapolatable Transformer (paper) - This improves **modeling capability** of scaling Transformers.
  - LongNet: Scaling Transformers to 1,000,000,000 Tokens (paper)
  - vLLM: Easy, Fast, and Cheap LLM Serving with PagedAttention - The improved throughput comes from VRAM savings on an otherwise close to fullly utilized GPU.
  - The Secret Sauce behind 100K context window in LLMs: all tricks in one place
  - Unlimiformer: Long-Range Transformers with Unlimited Length Input (paper)
  - The Illustrated Transformer
  - Generative Modeling with Sparse Transformers - an algorithmic improvement of the attention mechanism to extract patterns from sequences 30x longer than possible previously.
  - Stabilizing Transformers for Reinforcement Learning - they propose architectural modifications to the original Transformer and XL variant by moving layer-norm and adding gating creates Gated Transformer-XL (GTrXL). It substantially improve the stability and learning speed (integrating experience through time) in RL.
  - The Transformer Family - since the paper "Attention Is All You Need", many new things have happened to improve the Transformer model. This post is about that.
  - DETR (**DE**tection **TR**ansformer): End-to-End Object Detection with Transformers - :fire: Computer vision has not yet been swept up by the Transformer revolution. DETR completely changes the architecture compared with previous object detection systems. ([PyTorch Code and pretrained models](https://github.com/facebookresearch/detr)). "A solid swing at (non-autoregressive) end-to-end detection. Anchor boxes + Non-Max Suppression (NMS) is a mess. I was hoping detection would go end-to-end back in ~2013)" — Andrej Karpathy
  - Transformers for software engineers - This post will be helpful to software engineers who are interested in learning ML models, especially anyone interested in Transformer interpretability. The post walk through a (mostly) complete implementation of a GPT-style Transformer, but the goal will not be running code; instead, they use the language of software engineering and programming to explain how these models work and articulate some of the perspectives they bring to them when doing interpretability work.
  - Efficient Long Sequence Modeling via State Space Augmented Transformer (paper) - The quadratic computational cost of the attention mechanism limits its practicality for long sequences. There are existing attention variants that improve the computational efficiency, but they have limited ability to effectively compute global information. In parallel to Transformer models, state space models (SSMs) are tailored for long sequences, but they are not flexible enough to capture complicated local information. They propose SPADE, short for State sPace AugmenteD TransformEr, which performs various baselines, including Mega, on the Long Range Arena benchmark and various LM tasks. This is an interesting direction. SSMs and Transformers were combined a while back.
  - Hungry Hungry Hippos (H3): Towards Language Modeling with State Space Models (SSMs) (paper) - A new language modeling architecture. It **scales nearly linearly with context size instead of quadratically**. No more fixed context windows, long context for everyone. Despite that, SSMs are still slower than Transformers due to poor hardware utilization. So, a Transformer successor? [[Tweet](https://twitter.com/realDanFu/status/1617605971395891201)]
  - A Survey on Efficient Training of Transformers (paper) - The first systematic overview, covering 1) computation efficiency; optimization (i.e., sparse training) and data selection (i.e., token masking), 2) memory efficiency (i.e, data/model parallelism, offloading/use external mem) and 3) hardware/algorithm co-design (i.e, efficient attention, hardware-aware low-precisio).
  - Deep Transformers without Shortcuts: Modifying Self-attention for Faithful Signal Propagation (paper)
  - Hyena Hierarchy: Towards Larger Convolutional Language Models (paper) - Attention is great. Hyena is an alternative to attention that can learn on sequences **10x longer**, up to **100x faster** than optimized attention, by using implicit long convolutions and gating. [[Tweet](https://twitter.com/MichaelPoli6/status/1633167040130453505)]
  - Jump to Conclusions: Short-Cutting Transformers With Linear Transformations (paper)
  - CoLT5: Faster Long-Range Transformers with Conditional Computation (paper) - 64K context size for language models! This approach enables faster training and inference while maintaining or improving performance compared to LONGT5. The main components of COLT5 include routing modules, conditional feedforward layers, and conditional attention layers. Routing modules select important tokens for each input and component, while the light branches process all tokens with lower capacity operations, and heavy branches apply higher capacity operations only on selected important tokens. Additionally, COLT5 incorporates multi-query cross-attention for faster inference speed as well as UL2 pre-training objective for improved in-context learning capabilities over long inputs. [[Tweet](https://twitter.com/miolini/status/1637677536921657344)]
  - The Secret Sauce behind 100K context window in LLMs: all tricks in one place
  - google-research/meliad - The Meliad library is collection of models which are being developed as part of ongoing Google research into various architectural improvements in deep learning. The library currently consists of several transformer variations, which explore ways in which the popular transformer architecture can be extended to better support language modeling over long sequences. The variations are Memorizing Transformer, Transformer with sliding window, Block-Recurrent Transformer, and more.
  - The Secret Sauce behind 100K context window in LLMs: all tricks in one place
  - The Secret Sauce behind 100K context window in LLMs: all tricks in one place
  - The Secret Sauce behind 100K context window in LLMs: all tricks in one place
  - The Secret Sauce behind 100K context window in LLMs: all tricks in one place
  - The Secret Sauce behind 100K context window in LLMs: all tricks in one place
  - The Secret Sauce behind 100K context window in LLMs: all tricks in one place
  - The Secret Sauce behind 100K context window in LLMs: all tricks in one place
  - The Secret Sauce behind 100K context window in LLMs: all tricks in one place
  - The Secret Sauce behind 100K context window in LLMs: all tricks in one place
  - The Secret Sauce behind 100K context window in LLMs: all tricks in one place
  - The Secret Sauce behind 100K context window in LLMs: all tricks in one place
  - The Secret Sauce behind 100K context window in LLMs: all tricks in one place
  - The Secret Sauce behind 100K context window in LLMs: all tricks in one place
  - The Secret Sauce behind 100K context window in LLMs: all tricks in one place
  - The Secret Sauce behind 100K context window in LLMs: all tricks in one place
  - The Secret Sauce behind 100K context window in LLMs: all tricks in one place
  - The Secret Sauce behind 100K context window in LLMs: all tricks in one place
  - The Secret Sauce behind 100K context window in LLMs: all tricks in one place
  - The Secret Sauce behind 100K context window in LLMs: all tricks in one place
  - The Secret Sauce behind 100K context window in LLMs: all tricks in one place
  - The Secret Sauce behind 100K context window in LLMs: all tricks in one place
  - The Secret Sauce behind 100K context window in LLMs: all tricks in one place
  - The Secret Sauce behind 100K context window in LLMs: all tricks in one place
  - The Secret Sauce behind 100K context window in LLMs: all tricks in one place
  - The Secret Sauce behind 100K context window in LLMs: all tricks in one place
  - The Secret Sauce behind 100K context window in LLMs: all tricks in one place
  - Accelerating Large Language Model Decoding with Speculative Sampling (paper) - Speculative sampling algorithm enable the generation of multiple tokens from each transformer call. Achieves a 2–2.5x decoding speedup with Chinchilla in a distributed setup, without compromising the sample quality or making modifications to the model itself.
  - Łukasz Kaiser’s talk
  - The Secret Sauce behind 100K context window in LLMs: all tricks in one place
  - The Secret Sauce behind 100K context window in LLMs: all tricks in one place
  - The Secret Sauce behind 100K context window in LLMs: all tricks in one place
  - The Secret Sauce behind 100K context window in LLMs: all tricks in one place
  - The Secret Sauce behind 100K context window in LLMs: all tricks in one place
  - The Secret Sauce behind 100K context window in LLMs: all tricks in one place
  - The Secret Sauce behind 100K context window in LLMs: all tricks in one place
  - The Secret Sauce behind 100K context window in LLMs: all tricks in one place
  - The Secret Sauce behind 100K context window in LLMs: all tricks in one place
  - The Secret Sauce behind 100K context window in LLMs: all tricks in one place
  - The Secret Sauce behind 100K context window in LLMs: all tricks in one place
  - The Secret Sauce behind 100K context window in LLMs: all tricks in one place
  - The Secret Sauce behind 100K context window in LLMs: all tricks in one place
  - The Secret Sauce behind 100K context window in LLMs: all tricks in one place
  - Transformer-XL: Unleashing the Potential of Attention Models
  - Generative Modeling with Sparse Transformers - an algorithmic improvement of the attention mechanism to extract patterns from sequences 30x longer than possible previously.
  - The Secret Sauce behind 100K context window in LLMs: all tricks in one place
  - The Secret Sauce behind 100K context window in LLMs: all tricks in one place
  - The Secret Sauce behind 100K context window in LLMs: all tricks in one place
  - The Secret Sauce behind 100K context window in LLMs: all tricks in one place
  - The Secret Sauce behind 100K context window in LLMs: all tricks in one place
  - Differential Transformer - They presents **significant improvements over standard Transformers** in multiple dimensions, with particular emphasis on attention efficiency and practical applications in LM tasks. A new architecture that improves attention mechanisms by reducing attention to irrelevant context. Achieves better performance while requiring fewer parameters and training tokens compared to standard Transformers. Solution: introduces "differential attention" mechanism that calculates attention scores as the difference between two separate softmax attention. This subtraction cancel out noise. It can be implemented efficiently using existing FlashAttention. Scaling efficiency: **needs only ~65% of parameters or training tokens to match standard Transformer performance**. Improvements: 1) Better performance on long sequences up to 64K tokens. 2) Better at finding key information embedded in documents. 3) ICL: more robust to prompt order permutations. 4) Reduces attention misallocation, a primary cause of hallucinations. Technical details: includes headwise normalization to handle sparse attention patterns, etc. Future: development of efficient low-bit attention kernels, potential for compressing KV caches due to sparser attention patterns, etc. [Listen to [NotebookLM podcast](https://notebooklm.google.com/notebook/8e4c0907-8b12-4bc9-9d29-26c03daab71d/audio)]
  - The Secret Sauce behind 100K context window in LLMs: all tricks in one place
  - Mixture-of-Depths (MoD): Dynamically allocating compute in transformer-based language models - MoD method scale in depth dimension while keeping the FLOPs constant (similarly how Mixture of Experts (MoE) does it in width). MoD model can learn to route more complex tokens through more layers (similarly how experts in MoE can specialize to certain domains). The group explores how to optimize compute budget and improve efficiency without sacrificing performance. Results: MoD matches baseline performance with 66% faster training. Now, the question is, can it scale above 1B tokens. They tested on 500M tokens. [ELI5 version: [Mixture of Depths Meets Mixture of Experts](https://lifeinthesingularity.com/p/googles-breakthroughs-in-ai-design)]
  - Extending Context Window of Large Language Models via Positional Interpolation - Position Interpolation (PI) is an effective and efficient way to stably extend the context window of RoPE-based pretrained large language models such as LLaMA to much longer lengths (up to 32768) with minimal fine-tuning (within thousandth steps) while maintaining performance.
  - PoSE: Efficient Context Window Extension of LLMs via Positional Skip-wise Training - PoSE simulates longer input sequences during training by manipulating the position indices within a fixed context window, rather than training on the full target length. This allows decoupling of the training length from the target context length, greatly reducing memory and computational requirements compared to full-length fine-tuning. PoSE successfully extended LLaMA-1 to support context lengths up to 128k tokens using only a 2k training window, with minimal performance degradation. [This model](https://huggingface.co/winglian/Llama-3-8b-64k-PoSE) uses PoSE to extend Llama-3 8B context length from 8k to 64k. PoSE has potential to scale context lengths even further, limited only by inference memory, as efficient inference techniques continue improving. [Code: [PoSE](https://github.com/dwzhu-pku/PoSE)]
  - The Secret Sauce behind 100K context window in LLMs: all tricks in one place
  - The Secret Sauce behind 100K context window in LLMs: all tricks in one place
  - The Secret Sauce behind 100K context window in LLMs: all tricks in one place
  - The Secret Sauce behind 100K context window in LLMs: all tricks in one place
  - The Secret Sauce behind 100K context window in LLMs: all tricks in one place
  - The Secret Sauce behind 100K context window in LLMs: all tricks in one place
  - The Secret Sauce behind 100K context window in LLMs: all tricks in one place
  - The Secret Sauce behind 100K context window in LLMs: all tricks in one place
  - Better & Faster Large Language Models via Multi-token Prediction - What happens if we make language models predict several tokens ahead instead of only the next one? They show that replacing next token prediction tasks with multiple token prediction can result in substantially better code generation performance **with the exact same training budget and data — while also increasing inference performance by 3x**. While similar approaches have previously been used in fine-tuning to improve inference speed, **this research expands to pre-training for large models, showing notable behaviors and results at these scales**.
  - The Secret Sauce behind 100K context window in LLMs: all tricks in one place
  - FlashAttention: Fast and Memory-Efficient Exact Attention with IO-Awareness (paper) - Transformers have grown deeper and wider, but training them on long sequences remains difficult. The attention layer at their heart is the compute and memory bottleneck: doubling the sequence length would quadruple the runtime and memory requirements. FlashAttention is a new algorithm to speed up attention and reduce its memory footprint—without any approximation. It enables training LLMs with longer context. [[code](https://github.com/HazyResearch/flash-attention)]
  - The Secret Sauce behind 100K context window in LLMs: all tricks in one place
  - The Secret Sauce behind 100K context window in LLMs: all tricks in one place
  - The Secret Sauce behind 100K context window in LLMs: all tricks in one place
  - The Secret Sauce behind 100K context window in LLMs: all tricks in one place
  - The Secret Sauce behind 100K context window in LLMs: all tricks in one place
  - The Secret Sauce behind 100K context window in LLMs: all tricks in one place
  - The Secret Sauce behind 100K context window in LLMs: all tricks in one place
  - The Secret Sauce behind 100K context window in LLMs: all tricks in one place
  - The Secret Sauce behind 100K context window in LLMs: all tricks in one place
  - The Secret Sauce behind 100K context window in LLMs: all tricks in one place
  - The Secret Sauce behind 100K context window in LLMs: all tricks in one place
  - The Secret Sauce behind 100K context window in LLMs: all tricks in one place
  - The Secret Sauce behind 100K context window in LLMs: all tricks in one place
  - The Secret Sauce behind 100K context window in LLMs: all tricks in one place
  - The Secret Sauce behind 100K context window in LLMs: all tricks in one place
  - The Secret Sauce behind 100K context window in LLMs: all tricks in one place
  - Transformer-XL: Unleashing the Potential of Attention Models
  - The Secret Sauce behind 100K context window in LLMs: all tricks in one place
  - The Secret Sauce behind 100K context window in LLMs: all tricks in one place
  - The Secret Sauce behind 100K context window in LLMs: all tricks in one place
  - The Secret Sauce behind 100K context window in LLMs: all tricks in one place
  - The Secret Sauce behind 100K context window in LLMs: all tricks in one place
  - The Secret Sauce behind 100K context window in LLMs: all tricks in one place
  - The Secret Sauce behind 100K context window in LLMs: all tricks in one place
  - nGPT: Normalized Transformer with Representation Learning on the Hypersphere - A novel Transformer architecture where all vectors (embeddings, MLP, attention matrices, hidden states) are normalized to unit norm and operate on a hypersphere. Achieves 4-20x faster convergence during training compared to standard Transformers. Eliminates the need for weight decay by enforcing normalization. Normalization approach: matrix-vector multiplications become dot products bounded in [-1,1]. Architecture changes: 1) Attention mechanism - normalizes QKV projection matrices, introduces trainable scaling factors for Q-K dot products, 2) Layer structure: introduces learnable "eigen learning rates" (α) for attention and MLP blocks. Theoretical: can be interpreted in the context of Riemannian optimization. Advantages: more stable training, improved performance on downstream tasks, simplified architecture.
  - The Secret Sauce behind 100K context window in LLMs: all tricks in one place
  - The Secret Sauce behind 100K context window in LLMs: all tricks in one place
  - The Secret Sauce behind 100K context window in LLMs: all tricks in one place
  - The Secret Sauce behind 100K context window in LLMs: all tricks in one place
  - The Secret Sauce behind 100K context window in LLMs: all tricks in one place
  - The Secret Sauce behind 100K context window in LLMs: all tricks in one place
  - The Secret Sauce behind 100K context window in LLMs: all tricks in one place
  - The Secret Sauce behind 100K context window in LLMs: all tricks in one place
  - The Secret Sauce behind 100K context window in LLMs: all tricks in one place
  - The Secret Sauce behind 100K context window in LLMs: all tricks in one place
  - PaLM 2 Technical Report (PDF)
  - The Secret Sauce behind 100K context window in LLMs: all tricks in one place
  - The Secret Sauce behind 100K context window in LLMs: all tricks in one place
  - The Secret Sauce behind 100K context window in LLMs: all tricks in one place
  - The Secret Sauce behind 100K context window in LLMs: all tricks in one place
  - The Transformer blog post
  - Pathways Language Model (PaLM): Scaling to 540 Billion Parameters for Breakthrough Performance - PaLM is a dense decoder-only Transformer model trained with the Pathways system, which enabled Google to efficiently train a single model across multiple TPU v4 Pods. The example explaining a joke is remarkable. This shows that it can generate explicit explanations for scenarios that require a complex combination of multi-step logical inference, world knowledge, and deep language understanding.
  - vLLM: Easy, Fast, and Cheap LLM Serving with PagedAttention - The improved throughput comes from VRAM savings on an otherwise close to fullly utilized GPU.
  - The Secret Sauce behind 100K context window in LLMs: all tricks in one place
- Large Language Model (LLM)
  - GPT-J-6B - Can't access GPT-3? Here's GPT-J — its open-source cousin.
  - GPT-J-6B - Can't access GPT-3? Here's GPT-J — its open-source cousin.
  - GPT-J-6B - Can't access GPT-3? Here's GPT-J — its open-source cousin.
  - Characterizing Attribution and Fluency Tradeoffs for Retrieval-Augmented Large Language Models (paper) - Despite recent progress, it has been difficult to prevent semantic hallucinations in generative LLMs. One common solution to this is augmenting LLMs with a retrieval system and making sure that the generated output is attributable to the retrieved information.
  - ChatGPT among the LLMs
  - GPT-J-6B - Can't access GPT-3? Here's GPT-J — its open-source cousin.
  - Fun and Dystopia With AI-Based Code Generation Using GPT-J-6B - Prior to GitHub Copilot tech preview launch, Max Woolf, a data scientist tested GPT-J-6B's code "writing" abilities.
  - BigScience's BLOOM-176B - BLOOM is a 175-billion parameter model for language processing, able to generate text much like GPT-3 and OPT-175B. It was developed to be multilingual, being deliberately trained on datasets containing 46 natural languages and 13 programming languages.
  - bitsandbytes-Int8 inference for Hugging Face models - You can run BLOOM-176B/OPT-175B easily on a single machine, without performance degradation. If true, this could be a game changer in enabling people outside of big tech companies being able to use these LLMs.
  - WeLM: A Well-Read Pre-trained Language Model for Chinese (paper)
  - Teaching Small Language Models to Reason (paper) - They finetune a student model on the chain of thought (CoT) outputs generated by a larger teacher model. For example, the **accuracy of T5 XXL on GSM8K improves from 8.11% to 21.99%** when finetuned on PaLM-540B generated chains of thought.
  - ALERT: Adapting Language Models to Reasoning Tasks (paper) - They introduce ALERT, a benchmark and suite of analyses for assessing language models' reasoning ability comparing pre-trained and finetuned models on complex tasks that require reasoning skills to solve. It covers 10 different reasoning skills including logistic, causal, common-sense, abductive, spatial, analogical, argument and deductive reasoning as well as textual entailment, and mathematics.
  - Evaluating Human-Language Model Interaction (paper) - They find that non-interactive performance does not always result in better human-LM interaction and that first-person and third-party metrics can diverge, suggesting the importance of examining the nuances of human-LM interaction.
  - Unnatural Instructions: Tuning Language Models with (Almost) No Human Labor (paper) - instructions)] - Fine-tuning a T5 on a large dataset collected with virtually no human labor leads to a model that surpassing the performance of models such as T0++ and Tk-Instruct across various benchmarks. These results demonstrate the potential of model-generated data as a **cost-effective alternative to crowdsourcing for dataset expansion and diversification**.
  - Rethinking with Retrieval: Faithful Large Language Model Inference (paper) - They shows the potential of enhancing LLMs by retrieving relevant external knowledge based on decomposed reasoning steps obtained through chain-of-thought (CoT) prompting. I predict we're going to see many of these types of retrieval-enhanced LLMs in 2023.
  - Progressive Prompts: Continual Learning for Language Models (paper) - Current LLMs have hard time with catastrophic forgetting and leveraging past experiences. The approach learns a prompt for new task and concatenates with frozen previously learned prompts. This efficiently transfers knowledge to future tasks. [[code](https://github.com/arazd/ProgressivePrompts)]
  - Large Language Models Can Be Easily Distracted by Irrelevant Context (paper) - Adding the instruction "Feel free to ignore irrelevant information given in the questions." consistently improves robustness to irrelevant context.
  - Toolformer: Language Models Can Teach Themselves to Use Tools (paper) - A smaller model trained to translate human intention into actions (i.e. decide which APIs to call, when to call them, what arguments to pass, and how to best incorporate the results into future token prediction).
  - ERNIE 3.0 Titan: Exploring Larger-scale Knowledge Enhanced Pre-training for Language Understanding and Generation (paper) - ERNIE 3.0 Titan is the latest addition to Baidu's ERNIE (Enhanced Representation through kNowledge IntEgration) family. It's inspired by the masking strategy of Google's BERT. ERNIE is also a unified framework. They also proposed a controllable learning algorithm and a credible learning algorithm. They apply online distillation technique to compress their model. To their knowledge, it is the largest (260B parameters) Chinese dense pre-trained model so far. [[article](http://research.baidu.com/Blog/index-view?id=165)]
  - Augmented Language Models (ALMs): a Survey (paper) - Augmenting language models with reasoning skills and the ability to use various, non-parametric external modules for context processing and outperform traditional LMs on several benchmarks. This new research direction has the potential to address interpretability, consistency and scalability issues.
  - A Comprehensive Survey on Pretrained Foundation Models: A History from BERT to ChatGPT (paper) - My remarks: this paper raises a lot of questions around the term "foundation models", i.e., what's the model bare minimum number of parameters to qualify as foundation? It sounds to me foundation models are an "invented" concept that doesn't have good validity.
  - Multimodal Chain-of-Thought Reasoning in Language Models (paper) - The model outperform GPT-3.5 by 16% on the ScienceQA benchmark. This work is the first to study CoT reasoning in different modalities, language (text) and vision (images). Unfortunately, they never provide ablation study on how much of that performance gain was caused by the new modalities. [[code](https://github.com/amazon-science/mm-cot)]
  - RECITE: Recitation-Augmented Language Models (paper) - How can ChatGPT-like models achieve greater factual accuracy without relying on an external retrieval search engine? This paper shows that recitation can help LLMs generate accurate factual knowledge by reciting relevant passages from their own memory (by sampling) before producing final answers. The core idea is motivated by the intuition: recite-step that recollects relevant knowledge pieces helps answer-step (generation) better output. That's a recite-answer paradigm: first ask the LLM to generate the support paragraphs that contain the answer (knowledge-recitation) and then use it as additional prompt, along with the question to ask the LLM to generate the answer. They verify the effectiveness on four LLMs. They also show that recitation can be more effective than retrieval. This is important since having a retriever may lead to unpredictable behavior (i.e., Bing/Sydney). [[code](https://github.com/Edward-Sun/RECITE)]
  - LLaMA: Open and Efficient Foundation Language Models (paper) - A collection of language models ranging from 7B to 65B parameters. LLaMA-13B outperforms GPT-3 (175B) on most benchmarks, and LLaMA-65B is competitive with the best models, Chinchilla70B and PaLM-540B. This shows that **smaller models trained with more data can outperform larger models**. This is **not contradictory to anything in the Chinchilla paper**, because it's not compute-optimally trained. GPU hours for training 7B model=82,432, 65B model=1,022,362 :scream:. Total time spent for all models: 2048 A100-80GB GPU for a period of approximately 5 months. The 65B model cost something in the range of ~$1-4M. Access to the model will be granted on a case-by-case basis though. People interested can apply for access. (Mar 2: [they just approved access to the models](https://twitter.com/cedric_chee/status/1631182890418712578), llama-7B works in Colab [cedrickchee/llama](https://github.com/cedrickchee/llama/blob/main/notebooks/vi_LLaMA_alpha.ipynb)) [Takeaways: [Tweet](https://threadreaderapp.com/thread/1629496763148017665.html)]
  - Language Is Not All You Need: Aligning Perception with Language Models (paper) - They introduce KOSMOS-1, a Multimodal Large Language Model (MLLM) that can perceive general modalities, learn in context (i.e., few-shot), and follow instructions (i.e., zero-shot). The total number of parameters is about 1.6B.
  - Check Your Facts and Try Again: Improving Large Language Models with External Knowledge and Automated Feedback - LLM-Augmenter significantly reduces ChatGPT's hallucinations without sacrificing the fluency and informativeness of its responses. The architecture and data flow: 1) Retrieve evidence from external knowledge. 2) Context and reasoning chains. 3) Give to LLM (i.e., ChatGPT). 4) Verify hallucinations. 5) If hallucinate, give feedback and revise.
  - UL2: Unifying Language Learning Paradigms (paper) - UL2 is a unified framework for pretraining models that are universally effective across datasets and setups. _Takeaways: Objective matters way more than architecture. Mixture-of-Denoisers (MoD) is effective if you care about doing well on more than one type of tasks/settings._ UL2 frames different objective functions for training language models as denoising tasks, where the model has to recover missing sub-sequences of a given input. During pre-training it uses Mixture-of-Denoisers (MoD) that samples from a varied set of such objectives, each with different configurations. MoD combines diverse pre-training paradigms together. They demonstrated that models trained using the UL2 framework perform well in a variety of language domains, including prompt-based few-shot learning and models fine-tuned for down-stream tasks. They open sourced UL2 20B model and checkpoints back in 2022. In 2023, they open sourced Flan-UL2 20B and released the weights. Check out: [[blog post](https://archive.is/20230303191656/https://www.yitay.net/blog/flan-ul2-20b), [Tweet](https://twitter.com/YiTayML/status/1631359474421366784)]. I'm excited to see what the community does with this new model.
  - Larger language models do in-context learning (ICL) differently (paper) - Overriding semantic priors when presented with enough flipped labels is an emergent ability of scale. LLMs learn better mappings when ICL labels are semantically unrelated to inputs (i.e., apple/orange, negative/positive). Fine-tuning to follow instruction helps both. [[Tweet](https://twitter.com/JerryWeiAI/status/1633548780619571200)]
  - The BigScience ROOTS Corpus: A 1.6TB Composite Multilingual Dataset (paper) - Documents the data creation and curation efforts of Responsible Open-science Open-collaboration Text Source (ROOTS) corpus, a dataset used to train BLOOM. [[Tweet](https://twitter.com/arankomatsuzaki/status/1633282997020672000)]
  - Context-faithful Prompting for Large Language Models (paper)
  - Code Llama: Open Foundation Models for Code (paper) - Code Llama is a family of LLMs for code based on Llama 2 providing SoTA performance among open models, infilling capabilities, support for large input contexts, and zero-shot instruction following ability for programming tasks. It is capable of generating code, and natural language about code, from both code and natural language prompts. It's available in three models: foundation models (Code Llama), Python specializations (Code Llama - Python), and instruction-following models (Code Llama - Instruct) with 7B, 13B and 34B parameters each. It outperformed publicly available LLMs benchmark on code tasks. They release it under the same permissive license (community license) as Llama 2.
  - GPT-J-6B - Can't access GPT-3? Here's GPT-J — its open-source cousin.
  - GPT-Code-Clippy (GPT-CC) - An open source version of GitHub Copilot. The GPT-CC models are fine-tuned versions of GPT-2 and GPT-Neo.
  - Metaseq - A codebase for working with [Open Pre-trained Transformers (OPT)](https://arxiv.org/abs/2205.01068).
  - YaLM 100B - like pretrained language model with 100B parameters for generating and processing text. It can be used **freely** by developers and researchers from all over the world.
  - GLM-130B: An Open Bilingual (Chinese and English) Pre-Trained Model (code and paper) - 130b/posts/glm-130b/)] - One of the major contributions is making LLMs cost affordable using int4 quantization so it can run in limited compute environments.
  - jeffhj/LM-reasoning - This repository contains a collection of papers and resources on reasoning in Large Language Models.
  - GPT-J-6B - Can't access GPT-3? Here's GPT-J — its open-source cousin.
  - GPT-J-6B - Can't access GPT-3? Here's GPT-J — its open-source cousin.
  - GPT-J-6B - Can't access GPT-3? Here's GPT-J — its open-source cousin.
  - GPT-J-6B - Can't access GPT-3? Here's GPT-J — its open-source cousin.
  - GPT-J-6B - Can't access GPT-3? Here's GPT-J — its open-source cousin.
  - GPT-J-6B - Can't access GPT-3? Here's GPT-J — its open-source cousin.
  - GPT-J-6B - Can't access GPT-3? Here's GPT-J — its open-source cousin.
  - GPT-J-6B - Can't access GPT-3? Here's GPT-J — its open-source cousin.
  - GPT-J-6B - Can't access GPT-3? Here's GPT-J — its open-source cousin.
  - GPT-J-6B - Can't access GPT-3? Here's GPT-J — its open-source cousin.
  - GPT-J-6B - Can't access GPT-3? Here's GPT-J — its open-source cousin.
  - GPT-J-6B - Can't access GPT-3? Here's GPT-J — its open-source cousin.
  - GPT-J-6B - Can't access GPT-3? Here's GPT-J — its open-source cousin.
  - GPT-J-6B - Can't access GPT-3? Here's GPT-J — its open-source cousin.
  - GPT-J-6B - Can't access GPT-3? Here's GPT-J — its open-source cousin.
  - GPT-J-6B - Can't access GPT-3? Here's GPT-J — its open-source cousin.
  - GPT-J-6B - Can't access GPT-3? Here's GPT-J — its open-source cousin.
  - GPT-J-6B - Can't access GPT-3? Here's GPT-J — its open-source cousin.
  - GPT-J-6B - Can't access GPT-3? Here's GPT-J — its open-source cousin.
  - GPT-J-6B - Can't access GPT-3? Here's GPT-J — its open-source cousin.
  - GPT-J-6B - Can't access GPT-3? Here's GPT-J — its open-source cousin.
  - GPT-J-6B - Can't access GPT-3? Here's GPT-J — its open-source cousin.
  - GPT-J-6B - Can't access GPT-3? Here's GPT-J — its open-source cousin.
  - GPT-NeoX-20B - A 20 billion parameter model trained using EleutherAI’s [GPT-NeoX](https://github.com/EleutherAI/gpt-neox) framework. They expect it to perform well on many tasks. You can try out the model on [GooseAI](https://goose.ai/) playground.
  - GPT-J-6B - Can't access GPT-3? Here's GPT-J — its open-source cousin.
  - GPT-J-6B - Can't access GPT-3? Here's GPT-J — its open-source cousin.
  - GPT-J-6B - Can't access GPT-3? Here's GPT-J — its open-source cousin.
  - GPT-J-6B - Can't access GPT-3? Here's GPT-J — its open-source cousin.
  - GPT-J-6B - Can't access GPT-3? Here's GPT-J — its open-source cousin.
  - GPT-J-6B - Can't access GPT-3? Here's GPT-J — its open-source cousin.
  - GPT-J-6B - Can't access GPT-3? Here's GPT-J — its open-source cousin.
  - GPT-J-6B - Can't access GPT-3? Here's GPT-J — its open-source cousin.
  - GPT-J-6B - Can't access GPT-3? Here's GPT-J — its open-source cousin.
  - GPT-J-6B - Can't access GPT-3? Here's GPT-J — its open-source cousin.
  - GPT-J-6B - Can't access GPT-3? Here's GPT-J — its open-source cousin.
  - GPT-J-6B - Can't access GPT-3? Here's GPT-J — its open-source cousin.
  - GPT-J-6B - Can't access GPT-3? Here's GPT-J — its open-source cousin.
  - Introducing Llama 3: The most capable openly available LLM to date (article) - In the coming months, they’ll share the Llama 3 research paper. [[Code](https://github.com/meta-llama/llama3)]
  - GPT-J-6B - Can't access GPT-3? Here's GPT-J — its open-source cousin.
  - GPT-J-6B - Can't access GPT-3? Here's GPT-J — its open-source cousin.
  - GPT-J-6B - Can't access GPT-3? Here's GPT-J — its open-source cousin.
  - GPT-J-6B - Can't access GPT-3? Here's GPT-J — its open-source cousin.
  - GPT-J-6B - Can't access GPT-3? Here's GPT-J — its open-source cousin.
  - Gemini 1.5: Unlocking multimodal understanding across millions of tokens of context - The Gemini 1.5 model family technical report. Highlights: Gemini 1.5 Pro is now Google's most capable model (surpassing 1.0 Ultra), Gemini 1.5 models achieve near-perfect recall on long-context retrieval tasks across modalities, improvement in next-token prediction and near-perfect retrieval (>99%) up to at least 10M tokens.
  - GPT-J-6B - Can't access GPT-3? Here's GPT-J — its open-source cousin.
  - GPT-J-6B - Can't access GPT-3? Here's GPT-J — its open-source cousin.
  - GPT-J-6B - Can't access GPT-3? Here's GPT-J — its open-source cousin.
  - REPLUG: Retrieval-Augmented Black-Box Language Models (paper) - TL;DR: Enhancing GPT-3 with world knowledge — a retrieval-augmented LM framework that combines a frozen LM with a frozen/tunable retriever. It improves GPT-3 in language modeling and downstream tasks by prepending retrieved documents to LM inputs. [[Tweet](https://twitter.com/WeijiaShi2/status/1620497381962977281)]
  - PanGu-Σ: Towards Trillion Parameter Language Model with Sparse Heterogeneous Computing (paper) - They develop a system that trained a trillion-parameter language model on a cluster of Ascend 910 AI processors and MindSpore framework. This resulted in a 6.3x increase in training throughput through heterogeneous computing.
  - GPT-J-6B - Can't access GPT-3? Here's GPT-J — its open-source cousin.
  - GPT-J-6B - Can't access GPT-3? Here's GPT-J — its open-source cousin.
  - GPT-J-6B - Can't access GPT-3? Here's GPT-J — its open-source cousin.
  - GPT-J-6B - Can't access GPT-3? Here's GPT-J — its open-source cousin.
  - GPT-J-6B - Can't access GPT-3? Here's GPT-J — its open-source cousin.
  - GPT-J-6B - Can't access GPT-3? Here's GPT-J — its open-source cousin.
  - GPT-J-6B - Can't access GPT-3? Here's GPT-J — its open-source cousin.
  - GPT-J-6B - Can't access GPT-3? Here's GPT-J — its open-source cousin.
  - GPT-J-6B - Can't access GPT-3? Here's GPT-J — its open-source cousin.
  - GPT-J-6B - Can't access GPT-3? Here's GPT-J — its open-source cousin.
  - Llama 2: Open Foundation and Fine-Tuned Chat Models (paper) - Llama 2 pretrained models are trained on 2 trillion tokens, and have double the context length than Llama 1. Its fine-tuned models have been trained on over 1 million human annotations. It outperforms other open source language models on many benchmarks. License: The model and weights are available for free for research and commercial use. It is not an open source model, rather an open approach model — for commercial use, your product cannot have more than 700 million monthly active users and requires a form to get access. Llama-2-chat is the new addition and is created through using supervised fine-tuning and then iteratively refined using RLHF. [[Nathan Lambert's summary of the paper](https://www.interconnects.ai/p/llama-2-from-meta)]
- Generative Pre-Training Transformer (GPT)
  - GitHub Copilot - Codex is a descendant of GPT-3. Codex translates natural language into code.
  - Better Language Models and Their Implications
  - Improving Language Understanding with Unsupervised Learning - this is an overview of the original OpenAI GPT model.
  - 🦄 How to build a State-of-the-Art Conversational AI with Transfer Learning
  - The Illustrated GPT-2 (Visualizing Transformer Language Models)
  - OpenGPT-2: We Replicated GPT-2 Because You Can Too - the authors trained a 1.5 billion parameter GPT-2 model on a similar sized text dataset and they reported results that can be compared with the original model.
  - MSBuild demo of an OpenAI generative text model generating Python code - The model that was trained on GitHub OSS repos. The model uses English-language code comments or simply function signatures to generate entire Python functions. Cool!
  - GPT-3: Language Models are Few-Shot Learners (paper) - "We train GPT-3, an autoregressive language model with 175 billion parameters :scream:, 10x more than any previous non-sparse language model, and test its performance in the few-shot setting."
  - How GPT3 Works - Visualizations and Animations
  - GPT-4 Rumors From Silicon Valley - GPT-4 is almost ready. GPT-4 would be multimodal, accepting text, audio, image, and possibly video inputs. Release window: Dec - Feb. #hype
  - MegatronLM: Training Billion+ Parameter Language Models Using GPU Model Parallelism
  - New GPT-3 model: text-Davinci-003 - Improvements:
  - GPT-4 research
  - A Comprehensive Capability Analysis of GPT-3 and GPT-3.5 Series Models
  - Sparks of Artificial General Intelligence: Early experiments with GPT-4 - There are completely mind-blowing examples in the paper.
  - GPT-3.5 series - tuning by playing both sides of the conversation.
  - People have noticed - 3 models.
  - elyase/awesome-gpt3 - A collection of demos and articles about the OpenAI GPT-3 API.
  - ChatGPT Universe
  - New GPT-3 model: text-Davinci-003 - Improvements:
  - GPT-3.5 series - tuning by playing both sides of the conversation.
  - Better Language Models and Their Implications
  - What is ChatGPT?
  - GitHub Copilot - Codex is a descendant of GPT-3. Codex translates natural language into code.
  - GitHub Copilot - Codex is a descendant of GPT-3. Codex translates natural language into code.
- Attention Concept
  - Making Transformer networks simpler and more efficient - FAIR released an all-attention layer to simplify the Transformer model and an adaptive attention span method to make it more efficient (reduce computation time and memory footprint).
- Attention Mechanism
  - Attention? Attention! - Attention guide by Lilian Weng from OpenAI.
  - ![Visualizing Attention, a Transformer's Heart
  - What Does BERT Look At? An Analysis of BERT’s Attention paper
  - Fast Transformer Decoding: One Write-Head is All You Need (paper) - They proposed a variant of attention type called **multi-query attention** (MQA). The plain multi-head attention mechanism has one query, key, and value per head; multi-query instead **shares one key and value across all of the different attention "heads"**. In practice, training time remains the same, but **much faster to decode in inference**. MQA significantly improves language models performance and efficiency. Users can get ~10x better throughput and ~30% lower latency on inference. However, MQA can lead to quality degradation, and moreover it may not be desirable to train a separate model just for faster inference. In 2022, PaLM, a decoder-style model and their use of MQA is an interesting architecture improvements over GPT. Recent models that use MQA include [TII's Falcon](https://falconllm.tii.ae/) (2023).
  - GQA: Training Generalized Multi-Query Transformer Models from Multi-Head Checkpoints - They (1) propose a **technique for uptraining existing multi-head attention (MHA) models into models with multi-query attention (MQA)** using 5% of original pre-training compute, and (2) introduce **grouped-query attention (GQA)**, a generalization of MQA which uses an intermediate (more than one, less than number of query heads) number of key-value heads. GQA achieves **benefits close to MHA** with **comparable inference speed to MQA** through reduced number of key-value heads. Models that use MQA include Meta's Llama 2 (2023). [[Some Tweets](https://twitter.com/_philschmid/status/1673335690912825347?s=20)]
  - Ring Attention with Blockwise Transformers for Near-Infinite Context - Ring Attention is a system-level optimization technique by leveraging specific hardware architecture to make the exact attention computation more efficient.
  - Leave No Context Behind: Efficient Infinite Context Transformers with Infini-attention - Infini-attention has an additional compressive memory with linear attention for processing infinitely long contexts. They trained a 1B parameter Transformer model that was fine-tuned on up to 5K sequence length passkey instances solves the 1M tokens input length problem. The Infini-attention mechanism presents an efficient and powerful approach for Transformer language models to process very long contexts without prohibitive increases in memory or computation.
  - Retrieval Head Mechanistically Explains Long-Context Factuality - The paper explains how LLMs actually deal with context windows. The findings: they discover LLMs have unexpectedly developed retrieval heads, they were not explicitly coded for by creators. [Code: [An algorithm that statistically calculate the retrieval score of attention heads in a transformer model](https://github.com/nightdessert/Retrieval_Head)]
  - Neural Machine Translation by Jointly Learning to Align and Translate - [Bahdanau invented the content-based neural attention that is now a core tool in deep-learning-based NLP (language models)](https://archive.is/JxMmF#selection-99.0-103.76). A disadvantage of fixed-length context vector design is incapability of remembering long sentences. The attention mechanism was born to resolve this problem. It was born to help memorize long input sentences in language translation. [[Bahdanau deserve the praise](https://archive.is/3DwY5)]
  - The Annotated Transformer by Harvard NLP Group - Further reading to understand the "Attention is all you need" paper.
  - Visualizing A Neural Machine Translation Model (Mechanics of Seq2seq Models With Attention)
  - Making Transformer networks simpler and more efficient - FAIR released an all-attention layer to simplify the Transformer model and an adaptive attention span method to make it more efficient (reduce computation time and memory footprint).
- Additional Reading
  - Language, trees, and geometry in neural networks - a series of expository notes accompanying the paper, "Visualizing and Measuring the Geometry of BERT" by Google's People + AI Research (PAIR) team.
  - Benchmarking Transformers: PyTorch and TensorFlow - a comparison of inference time (on CPU and GPU) and memory usage for a wide range of transformer architectures.
  - Evolution of representations in the Transformer - An accessible article that presents the insights of their EMNLP 2019 paper. They look at how the representations of individual tokens in Transformers trained with different objectives change.
  - The dark secrets of BERT - This post probes fine-tuned BERT models for linguistic knowledge. In particular, the authors analyse how many self-attention patterns with some linguistic interpretation are actually used to solve downstream tasks. TL;DR: They are unable to find evidence that linguistically interpretable self-attention maps are crucial for downstream performance.
  - A Visual Guide to Using BERT for the First Time - Tutorial on using BERT in practice, such as for sentiment analysis on movie reviews by Jay Alammar.
  - Turing-NLG: A 17-billion-parameter language model - us/research/blog/zero-deepspeed-new-system-optimizations-enable-training-models-with-over-100-billion-parameters).
  - MUM (Multitask Unified Model): A new AI milestone for understanding information
  - GPT-3 is No Longer the Only Game in Town - GPT-3 was by far the largest AI model of its kind last year (2020). Now? Not so much.
  - OpenAI's API Now Available with No Waitlist - GPT-3 access without the wait. However, apps must be approved before [going live](https://beta.openai.com/docs/going-live). This release also allow them to review applications, monitor for misuse, and better understand the effects of this tech.
  - The Inherent Limitations of GPT-3 - One thing missing from the article if you've read [Gwern's GPT-3 Creative Fiction article](https://www.gwern.net/GPT-3#repetitiondivergence-sampling) before is the mystery known as "Repetition/Divergence Sampling":
  - Building games and apps entirely through natural language using OpenAI's code-davinci model - The author built several small games and apps without touching a single line of code, simply by telling the model what they want.
  - OpenAI rival Cohere launches language model API - Backed by AI experts, they aims to bring Google-quality predictive language to the masses. Aidan Gomez co-wrote a seminal 2017 paper at Google Brain that invented a concept known as "Transformers".
  - State of AI Report 2022 - Key takeaways:
  - How to Build OpenAI's GPT-2: "The AI That's Too Dangerous to Release"
  - How the Transformers broke NLP leaderboards
  - Real-time Natural Language Understanding with BERT using NVIDIA TensorRT
  - NLP's Clever Hans Moment has Arrived
  - GPT-3 can run code - You provide an input text and a command and GPT-3 will transform them into an expected output. It works well for tasks like changing coding style, translating between programming languages, refactoring, and adding doc. For example, converts JSON into YAML, translates Python code to JavaScript, improve the runtime complexity of the function.
  - Using GPT-3 to explain how code works
  - Character AI announces they're building a full stack AGI company - founders Noam Shazeer (co-invented Transformers, scaled them to supercomputers for the first time, and pioneered large-scale pretraining) and Daniel de Freitas (led the development of LaMDA), all of which are foundational to recent AI progress.
  - Startups competing with OpenAI's GPT-3 all need to solve the same problems - Last year, two startups released their own proprietary text-generation APIs. AI21 Labs, launched its 178-billion-parameter Jurassic-1 in Aug 2021, and Cohere, released a range of models. Cohere hasn't disclosed how many parameters its models contain. ... There are other up-and-coming startups looking to solve the same issues. Anthropic, the AI safety and research company started by a group of ex-OpenAI employees. Several researchers have left Google Brain to join two new ventures started by their colleagues. One outfit is named Character.ai, and the other Persimmon Labs.
  - Cohere Wants to Build the Definitive NLP Platform - Beyond generative models like GPT-3.
  - The Scaling Hypothesis - On GPT-3: meta-learning, scaling, implications, and deep theory.
  - The Next Generation Of Large Language Models - It highlights 3 emerging areas: 1) models that can generate their own training data to improve themselves, 2) models that can fact-check themselves, and 3) massive sparse expert models.
  - GPT-4 analysis and predictions - Somehow related, in ["Bing Chat is blatantly, aggressively misaligned"](https://www.lesswrong.com/posts/jtoPawEhLNXNxvgTT/bing-chat-is-blatantly-aggressively-misaligned) post, Gwern think how Bing Chat/Sydney can be so different from ChatGPT and his hypothesis: "Sydney is not a RLHF trained GPT-3 model but a GPT-4 model developed in a hurry". Some have also argued that Sydney performs better on reasoning tasks than ChatGPT/GPT-3.5 and it may be GPT-4.
  - Mosaic LLMs (Part 2): GPT-3 quality for <$500k (2022) - They claimed their [Composer PyTorch framework](https://github.com/mosaicml/composer) ease model training. Now with Colossal-AI framework, I wonder how good is their solution. Until their users train it, I guess everything is purely hypothetical.
  - Transformers From Scratch
  - Finetune Llama 3.1 on GCP for production use cases
  - OpenAI’s GPT2 - Food to Media hype or Wake Up Call?
  - I made a transformer by hand (no training!) - Make a transformer to predict a simple sequence manually — not by training one, or using pretrained weights, but instead by **assigning each weight, by hand**, over an evening.
  - Language Modelling at Scale: Gopher, Ethical considerations, and Retrieval - The paper present an analysis of Transformer-based language model performance across a wide range of model scales — from models with tens of millions of parameters up to a 280 billion parameter model called Gopher.
  - Building games and apps entirely through natural language using OpenAI's code-davinci model - The author built several small games and apps without touching a single line of code, simply by telling the model what they want.
  - OpenAI's API Now Available with No Waitlist - GPT-3 access without the wait. However, apps must be approved before [going live](https://beta.openai.com/docs/going-live). This release also allow them to review applications, monitor for misuse, and better understand the effects of this tech.
  - Open AI gets GPT-3 to work by hiring an army of humans to fix GPT’s bad answers
  - How Much Better is OpenAI’s Newest GPT-3 Model? - In addition to ChatGPT, OpenAI releases text-davinci-003, a Reinforcement Learning-tuned model that performs better long-form writing. Example, it can explain code in the style of Eminem. 😀
  - AI And The Limits Of Language — An AI system trained on words and sentences alone will never approximate human understanding - What LLMs like ChatGPT can and cannot do, and why AGI is not here yet.
  - Transformers From Scratch
  - Language Modelling at Scale: Gopher, Ethical considerations, and Retrieval - The paper present an analysis of Transformer-based language model performance across a wide range of model scales — from models with tens of millions of parameters up to a 280 billion parameter model called Gopher.
- Transformer Reinforcement Learning
  - Illustrating Reinforcement Learning from Human Feedback - Recent advances with language models (ChatGPT for example) have been powered by RLHF.
  - Training a Helpful and Harmless Assistant with RLHF (paper) - rlhf), [tweet](https://twitter.com/anthropicai/status/1514277273070825476)]
  - The Wisdom of Hindsight Makes Language Models Better Instruction Followers (paper) - The underlying RLHF algo is complex and requires an additional training pipeline for reward and value networks. They consider an alternative approach, Hindsight Instruction Relabeling (HIR): converting feedback to instruction by relabeling the original one and training the model for better alignment.
  - CarperAI/TRLX - Originated as a fork of TRL. It allows you to fine-tune Hugging Face language models (GPT2, GPT-NeoX based) up to 20B parameters using Reinforcement Learning. Brought to you by CarperAI (born at EleutherAI, an org part of StabilityAI family). CarperAI is developing production ready open-source RLHF tools. They have [announced plans for the first open-source "instruction-tuned" LM](https://carper.ai/instruct-gpt-announcement/).
  - allenai/RL4LMs - RL for language models (RL4LMs) by Allen AI. It's a modular RL library to fine-tune language models to human preferences.
  - Iterative Reasoning Preference Optimization (IRPO) - Llama-2-70B-Chat improves **from 55.6% to 81.6% on GSM8K** with this method. They apply iterative preference optimization to improve reasoning: generate chain-of-thought candidates with LLM, construct preference pairs based on if answer is correct or not, train with DPO + NLL, and repeat. For example, imagine a group of people trying to decide how to allocate a limited budget. Each person has their own priorities and preferences for how the money should be spent. Using the IRPO approach, the group would engage in a back-and-forth discussion, with each person adjusting their preferences based on the arguments and compromises made by the others. Over time, the group would converge on a set of preferences that everyone can accept, even if it's not exactly what any one person wanted initially.
  - From r to Q∗: Your Language Model is Secretly a Q-Function - The paper bridges the gap between two approaches to RLHF - the standard RLHF setup and Direct Preference Optimization (DPO) - by deriving DPO as a general inverse Q-learning algorithm in a token-level MDP (Markov Decision Process). The authors provide empirical insights into the benefits of DPO, including its ability to perform credit assignment, and demonstrate improvements over the base DPO policy using simple beam search, with **potential applications in multi-turn dialogue, reasoning, and agentic systems**.
  - lvwerra/TRL - Train transformer language models with reinforcement learning.
Papers
- SDNet: Contextualized Attention-based Deep Network for Conversational Question Answering
- The Llama 3 Herd of Models - The paper, a oft-overlooked component of the project, proved to be just as vital, if not more so, than the model itself, and its significance came as a complete surprise. A masterpiece in its own right, the paper presented a treasure trove of detailed information on the model's pre-training and post-training processes, offering insights that were both profound and practical. [[Discussion](https://old.reddit.com//r/LocalLLaMA/comments/1eabf4l)]
- BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding - Wei Chang, Kenton Lee and Kristina Toutanova.
- Transformer-XL: Attentive Language Models Beyond a Fixed-Length Context
- Conditional BERT Contextual Augmentation
- Language Models are Unsupervised Multitask Learners
- The Evolved Transformer
- XLNet: Generalized Autoregressive Pretraining for Language Understanding
- Thomas Wolf
- Comments from HN
- CTRL: Conditional Transformer Language Model for Controllable Generation
- Exploring the Limits of Transfer Learning with a Unified Text-to-Text Transformer
- Reformer: The Efficient Transformer
- Supervised Multimodal Bitransformers for Classifying Images and Text
- A Primer in BERTology: What we know about how BERT works
- Switch Transformers: Scaling to Trillion Parameter Models with Simple and Efficient Sparsity - brains-switch-transformer-language-model-packs-1-6-trillion-parameters/)
- An Attention Free Transformer
- A Survey of Transformers
- Evaluating Large Language Models Trained on Code
- Training language models to follow instructions with human feedback - following/). [ChatGPT](https://openai.com/blog/chatgpt/) is a sibling model to InstructGPT.
- LaMDA: Language Models for Dialog Applications
- Scaling Instruction-Finetuned Language Models - They find that instruction finetuning with the above aspects dramatically improves performance on a variety of model classes (PaLM, T5, U-PaLM), prompting setups (zero-shot, few-shot, CoT), and evaluation benchmarks. Flan-PaLM 540B achieves SoTA performance on several benchmarks. They also publicly release [Flan-T5 checkpoints](https://github.com/google-research/t5x/blob/main/docs/models.md#flan-t5-checkpoints), which achieve strong few-shot performance even compared to much larger models, such as PaLM 62B.
- Emergent Abilities of Large Language Models
- Nonparametric Masked (NPM) Language Modeling - Nonparametric models with **500x fewer parameters outperform GPT-3 on zero-shot tasks.**
- Transformer models: an introduction and catalog - The goal of this paper is to offer a somewhat comprehensive but simple catalog and classification of the most popular Transformer models. The paper also includes an introduction to the most important aspects and innovation in Transformer models.
- Foundation Models for Decision Making: Problems, Methods, and Opportunities - A report of recent approaches (i.e., conditional generative modeling, RL, prompting) that ground pre-trained models (i.e., LMs) in practical decision making agents. Models can serve world dynamics or steer decisions.
- GPT-4 Technical Report
- PLMpapers - BERT (Transformer, transfer learning) has catalyzed research in pretrained language models (PLMs) and has sparked many extensions. This repo contains a list of papers on PLMs.
- tomohideshibata/BERT-related papers
- Training Compute-Optimal Large Language Models - 3, Gopher). DeepMind has found the secret to cheaply scale large language models — to be compute-optimal, model size and training data must be scaled equally. It shows that most LLMs are severely starved of data and under-trained. Given the [new scaling law](https://www.alignmentforum.org/posts/6Fpvch8RR29qLEWNH/chinchilla-s-wild-implications), even if you pump a quadrillion parameters into a model (GPT-4 urban myth), the gains will not compensate for 4x more training tokens.
- Improving language models by retrieving from trillions of tokens - The group explore an alternate path for efficient training with Internet-scale retrieval. The method is known as RETRO, for "Retrieval Enhanced TRansfOrmers". With RETRO **the model is not limited to the data seen during training – it has access to the entire training dataset through the retrieval mechanism. This results in significant performance gains compared to a standard Transformer with the same number of parameters**. RETRO obtains comparable performance to GPT-3 on the Pile dataset, despite using 25 times fewer parameters. They show that language modeling improves continuously as they increase the size of the retrieval database. [[blog post](https://www.deepmind.com/blog/improving-language-models-by-retrieving-from-trillions-of-tokens)]
- Thomas Wolf
Educational
- Additional Reading
  - The GPT-3 Architecture, on a Napkin
  - PicoGPT: GPT in 60 Lines of NumPy
  - Video explainer about the core of transformer architecture - Read The Illustrated Transformer, but still didn't feel like you had an intuitive understanding of what the various pieces of attention were doing? In this video, a more constructive approach to explaining the transformer and attention can help you understand it better: starting from a simple convolutional neural network (CNN), the author will step you through all of the changes that need to be made to a CNN to become a transformer.
  - A Hackers' Guide to Language Models (video) - A quick run through all the basic ideas of language models, how to use them (both open models and OpenAI-based models) using code as much as possible.
  - A visual intro to large language models (LLMs) by Jay Alammar/Cohere - A high-level look at LLMs and some of their applications for language processing. It covers text generation models (like GPT) and representation models (like BERT).
  - Interfaces for Explaining Transformer Language Models - A gentle visual to Transformer models by looking at input saliency and neuron activation inside neural networks. **Our understanding of why these models work so well, however, still lags behind these developments**.
  - minGPT - A PyTorch re-implementation of GPT, both training and inference. minGPT tries to be small, clean, interpretable and educational, as most of the currently available GPT model implementations can a bit sprawling. GPT is not a complicated model and this implementation is appropriately about 300 lines of code.
  - nanoGPT - It's a re-write of minGPT. Still under active development. The associated and ongoing video lecture series _[Neural Networks: Zero to Hero](https://karpathy.ai/zero-to-hero.html)_, build GPT, from scratch, in code and aspire to spell everything out. Note that Karpathy's bottom up approach and fast.ai teaching style work well together. (FYI, fast.ai has both top-down ("part 1") and bottom-up ("part 2") approach.)
  - Video explainer about the core of transformer architecture - Read The Illustrated Transformer, but still didn't feel like you had an intuitive understanding of what the various pieces of attention were doing? In this video, a more constructive approach to explaining the transformer and attention can help you understand it better: starting from a simple convolutional neural network (CNN), the author will step you through all of the changes that need to be made to a CNN to become a transformer.
  - A Hackers' Guide to Language Models (video) - A quick run through all the basic ideas of language models, how to use them (both open models and OpenAI-based models) using code as much as possible.
- Tutorials
  - How to train a new language model from scratch using Transformers and Tokenizers
AI Safety
- Tutorials
  - Transformer Circuits Thread - Can we reverse engineer transformer language models into human-understandable computer programs? Interpretability research benefits a lot from interactive articles. As part of their effort, they've created several other resources besides their paper like "A Mathematical Framework for Transformer Circuits" and ["toy models of superposition"](https://threadreaderapp.com/thread/1570087876053942272.html).
  - Discovering Language Model Behaviors with Model-Written Evaluations (paper) - They automatically generate evaluations with LMs. They discover new cases of inverse scaling where LMs get worse with size. They also find some of the first examples of inverse scaling in RLHF, where more RLHF makes LMs worse.
  - Cognitive Biases in Large Language Models
  - Yann LeCun's unwavering opinion on current (auto-regressive) LLMs (Tweet)
  - Core Views on AI Safety: When, Why, What, and How
  - Personalisation within bounds: A risk taxonomy and policy framework for the alignment of large language models with personalised feedback
  - GPTs are GPTs: An Early Look at the Labor Market Impact Potential of Large Language Models (paper) - The paper argues GPT is a General Purpose Technology.
  - Transformers learn in-context by gradient descent (paper) - transformers-learn-in-context-by-gradient-descent)]
  - Why Can GPT Learn In-Context? Language Models Secretly Perform Gradient Descent as Meta-Optimizers (paper)
  - Tracr: Compiled Transformers as a Laboratory for Interpretability (paper) - TRACR (TRAnsformer Compiler for RASP) is a compiler for converting RASP programs (DSL for Transformers) into weights of a GPT-like model. Usually, we train Transformers to encode algorithms in their weights. With TRACR, we go in the reverse direction; compile weights **directly** from explicit code. Why do this? Accelerate interpretability research. Think of it like formal methods (from software eng.) on Transformers. It can be difficult to check if the explanation an interpretability tool provides is correct. [[Tweet](https://twitter.com/davlindner/status/1613900577804525573), [code](https://github.com/deepmind/tracr)]
Videos
- [BERTology](https://huggingface.co/transformers/bertology.html)
- Attention and Transformer Networks
  - Sequence to Sequence Learning Animated (Inside Transformer Neural Networks and Attention Mechanisms)
  - Sequence to Sequence Learning Animated (Inside Transformer Neural Networks and Attention Mechanisms)
- General
  - Trials and tribulations of OPT-175B training by Susan Zhang at Meta - In this talk, they walk through the development lifecycle of OPT-175B, covering infrastructure and training convergence challenges faced at scale, along with methods of addressing these issues going forward. Amazing that they managed to pull off such a feat. Key takeaway: data matters a lot! Super deep understanding of neural networks nuts&bolts (LR, SGD, etc.) and engineering. Even more than usual time spend staring at the loss curves. Understanding the Chinchilla's scaling law of how the new architecture/algorithms works as you scale up. [[LLM training log](https://github.com/facebookresearch/metaseq/blob/main/projects/OPT/chronicles/OPT175B_Logbook.pdf)]
Transformer Implementations By Communities
- PyTorch
  - facebook/fairseq - RoBERTa: A Robustly Optimized BERT Pretraining Approach by Facebook AI Research. SoTA results on GLUE, SQuAD and RACE.
  - codertimo/BERT-pytorch - Google AI 2018 BERT pytorch implementation.
  - innodatalabs/tbert - PyTorch port of BERT ML model.
  - dreamgonfly/BERT-pytorch - A PyTorch implementation of BERT in "BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding".
  - dhlee347/pytorchic-bert - A Pytorch implementation of Google BERT.
  - NVIDIA/Megatron-LM - Ongoing research training transformer language models at scale, including: BERT.
  - deepset-ai/FARM - Simple & flexible transfer learning for the industry.
  - NVIDIA/NeMo - Neural Modules is a toolkit for conversational AI by NVIDIA. They are trying to [improve speech recognition with BERT post-processing](https://nvidia.github.io/NeMo/nlp/intro.html#improving-speech-recognition-with-bertx2-post-processing-model).
  - facebook/MMBT - Multimodal transformers model that can accept a transformer model and a computer vision model for classifying image and text.
  - dbiir/UER-py - Open Source Pre-training Model Framework in PyTorch & Pre-trained Model Zoo (with more focus on Chinese).
  - lucidrains/x-transformers - A simple but complete full-attention transformer with a set of promising experimental features from various papers (good for learning purposes). There is a 2021 paper rounding up Transformer modifications, [_Do Transformer Modifications Transfer Across Implementations and Applications?_](https://arxiv.org/abs/2102.11972).
  - kaushaltrivedi/fast-bert - Super easy library for BERT based NLP models. Built based on 🤗 Transformers and is inspired by fast.ai.
- PyTorch and TensorFlow
  - 🤗 Hugging Face Transformers - transformers](https://github.com/huggingface/pytorch-transformers) and [pytorch-pretrained-bert](https://github.com/huggingface/pytorch-pretrained-BERT)) provides state-of-the-art general-purpose architectures (BERT, GPT-2, RoBERTa, XLM, DistilBert, XLNet, CTRL...) for Natural Language Understanding (NLU) and Natural Language Generation (NLG) with over 32+ pretrained models in 100+ languages and deep interoperability between TensorFlow 2.0 and PyTorch. [[Paper](https://arxiv.org/abs/1910.03771)]
  - spacy-transformers - a library that wrap Hugging Face's Transformers, in order to extract features to power NLP pipelines. It also calculates an alignment so the Transformer features can be related back to actual words instead of just wordpieces.
  - FasterTransformer - Transformer related optimization, including BERT and GPT. This repo provides a script and recipe to run the highly optimized transformer-based encoder and decoder component, and it is tested and maintained by NVIDIA.
- Keras
  - Separius/BERT-keras - Keras implementation of BERT with pre-trained weights.
  - CyberZHG/keras-bert - Implementation of BERT that could load official pre-trained models for feature extraction and prediction.
  - bojone/bert4keras - Light reimplement of BERT for Keras.
- TensorFlow
  - kimiyoung/transformer-xl - Code repository associated with the Transformer-XL paper.
  - zihangdai/xlnet - Code repository associated with the XLNet paper.
- Chainer
  - soskek/bert-chainer - Chainer implementation of "BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding".
- Other
  - Cformers - SoTA Transformers with C-backend for fast inference on your CPU.
  - Alpaca.cpp - Run a fast ChatGPT-like model locally on your device.
  - LLaMA compatible port
  - Apple Neural Engine (ANE) Transformers - Transformer architecture optimized for Apple Silicon.
  - llama.cpp - Port of Facebook's LLaMA model in C/C++.
  - Transformers.js - Run 🤗 Transformers in your browser.
Transfer Learning in NLP
- Other
  - NLP's ImageNet moment
  - Semi-supervised Sequence Learning - us-west-2.amazonaws.com/openai-assets/research-covers/language-unsupervised/language_understanding_paper.pdf) (by OpenAI researchers [Radford](https://twitter.com/alecrad), [Narasimhan](https://twitter.com/karthik_r_n), [Salimans](https://twitter.com/timsalimans), and [Sutskever](https://twitter.com/ilyasut)), and the Transformer ([Vaswani et al](https://arxiv.org/abs/1706.03762)).
  - MultiFiT: Efficient Multi-lingual Language Model Fine-tuning
  - NLP's ImageNet moment
  - MultiFiT: Efficient Multi-lingual Language Model Fine-tuning
Books
- Other
  - Transfer Learning for Natural Language Processing - A book that is a practical primer to transfer learning techniques capable of delivering huge improvements to your NLP models.
  - Natural Language Processing with Transformers - This practical book shows you how to train and scale these large models using Hugging Face Transformers. The authors use a hands-on approach to teach you how transformers work and how to integrate them in your applications.
Tasks
- Text Generation
  - Plug and Play Language Models: a Simple Approach to Controlled Text Generation
  - asyml/texar - Toolkit for Text Generation and Beyond. [Texar](https://texar.io) is a general-purpose text generation toolkit, has also implemented BERT here for classification, and text generation applications by combining with Texar's other modules.
- Named-Entity Recognition (NER)
  - kyzhouhzau/BERT-NER - Use google BERT to do CoNLL-2003 NER.
  - zhpmatrix/bert-sequence-tagging - Chinese sequence labeling.
  - mhcao916/NER_Based_on_BERT - This project is based on Google BERT model, which is a Chinese NER.
  - ProHiryu/bert-chinese-ner - Use the pre-trained language model BERT to do Chinese NER.
  - FuYanzhe2/Name-Entity-Recognition - Lstm-CRF, Lattice-CRF, recent NER related papers.
  - macanv/BERT-BiLSMT-CRF-NER - TensorFlow solution of NER task using Bi-LSTM-CRF model with Google BERT fine-tuning.
- Classification
  - brightmart/sentiment_analysis_fine_grain - Multi-label classification with BERT; Fine Grained Sentiment Analysis from AI challenger.
  - zhpmatrix/Kaggle-Quora-Insincere-Questions-Classification - Kaggle baseline—fine-tuning BERT and tensor2tensor based Transformer encoder solution.
  - maksna/bert-fine-tuning-for-chinese-multiclass-classification - Use Google pre-training model BERT to fine-tune for the Chinese multiclass classification.
  - NLPScott/bert-Chinese-classification-task - BERT Chinese classification practice.
  - fooSynaptic/BERT_classifer_trial - BERT trial for Chinese corpus classfication.
  - Socialbird-AILab/BERT-Classification-Tutorial - Tutorial.
  - malteos/pytorch-bert-document-classification - Enriching BERT with Knowledge Graph Embedding for Document Classification (PyTorch)
- Question Answering (QA)
  - matthew-z/R-net - R-net in PyTorch, with BERT and ELMo.
  - benywon/ChineseBert - This is a Chinese BERT model specific for question answering.
  - facebookresearch/SpanBERT - Question Answering on SQuAD; improving pre-training by representing and predicting spans.
- Knowledge Graph
  - sakuranew/BERT-AttributeExtraction - Using BERT for attribute extraction in knowledge graph. Fine-tuning and feature extraction. The BERT-based fine-tuning and feature extraction methods are used to extract knowledge attributes of Baidu Encyclopedia characters.
  - lvjianxin/Knowledge-extraction - Chinese knowledge-based extraction. Baseline: bi-LSTM+CRF upgrade: BERT pre-training.
Official BERT Implementations
- General
  - google-research/bert - TensorFlow code and pre-trained models for BERT.
Other Resources
- Other
  - brightmart/bert_language_understanding - Pre-training of Deep Bidirectional Transformers for Language Understanding: pre-train TextCNN.
  - HighCWu/keras-bert-tpu - Implementation of BERT that could load official pre-trained models for feature extraction and prediction on TPU.
  - whqwill/seq2seq-keyphrase-bert - Add BERT to encoder part for https://github.com/memray/seq2seq-keyphrase-pytorch
  - Y1ran/NLP-BERT--Chinese version
  - Willyoung2017/Bert_Attempt
  - Pydataman/bert_examples - Some examples of BERT. `run_classifier.py` based on Google BERT for Kaggle Quora Insincere Questions Classification challenge. `run_ner.py` is based on the first season of the Ruijin Hospital AI contest and a NER written by BERT.
  - Microsoft/AzureML-BERT - End-to-end walk through for fine-tuning BERT using Azure Machine Learning.
  - yoheikikuta/bert-japanese - BERT with SentencePiece for Japanese text.
  - turtlesoupy/this-word-does-not-exist - "This Word Does Not Exist" is a project that allows people to train a variant of GPT-2 that makes up words, definitions and examples from scratch. We've never seen fake text so real.
  - hanxiao/bert-as-service - Mapping a variable-length sentence to a fixed-length vector using pretrained BERT model.
Tools
- Other
  - jessevig/bertviz - Tool for visualizing attention in the Transformer model.

Programming Languages

Python 54 Jupyter Notebook 5 C 2 C++ 1

awesome-transformer-nlp

Articles

BERT and Transformer

Transformer Architecture

Large Language Model (LLM)

Generative Pre-Training Transformer (GPT)

Attention Concept

Attention Mechanism

Additional Reading

Transformer Reinforcement Learning

Papers

Educational

Additional Reading

Tutorials

AI Safety

Tutorials

Videos

[BERTology](https://huggingface.co/transformers/bertology.html)

Attention and Transformer Networks

General

Transformer Implementations By Communities

PyTorch

PyTorch and TensorFlow

Keras

TensorFlow

Chainer

Other

Transfer Learning in NLP

Other

Books

Other

Tasks

Text Generation

Named-Entity Recognition (NER)

Classification

Question Answering (QA)

Knowledge Graph

Official BERT Implementations

General

Other Resources

Other

Tools

Other