Ecosyste.ms: Awesome
An open API service indexing awesome lists of open source software.
https://github.com/hyunwoongko/transformer
Transformer: PyTorch Implementation of "Attention Is All You Need"
https://github.com/hyunwoongko/transformer
attention dataset pytorch transformer
Last synced: 6 days ago
JSON representation
Transformer: PyTorch Implementation of "Attention Is All You Need"
- Host: GitHub
- URL: https://github.com/hyunwoongko/transformer
- Owner: hyunwoongko
- Created: 2019-10-15T10:36:00.000Z (over 5 years ago)
- Default Branch: master
- Last Pushed: 2024-08-06T14:40:08.000Z (6 months ago)
- Last Synced: 2025-01-09T03:25:01.200Z (13 days ago)
- Topics: attention, dataset, pytorch, transformer
- Language: Python
- Homepage:
- Size: 1.95 MB
- Stars: 3,225
- Watchers: 10
- Forks: 456
- Open Issues: 12
-
Metadata Files:
- Readme: README.md
Awesome Lists containing this project
README
### WARNING
This code was written in 2019, and I was not very familiar with transformer model in that time.
So don't trust this code too much. Currently I am not managing this code well, so please open pull requests if you find bugs in the code and want to fix.# Transformer
My own implementation Transformer model (Attention is All You Need - Google Brain, 2017)
![model](image/model.png)## 1. Implementations
### 1.1 Positional Encoding
![model](image/positional_encoding.jpg)
```python
class PositionalEncoding(nn.Module):
"""
compute sinusoid encoding.
"""
def __init__(self, d_model, max_len, device):
"""
constructor of sinusoid encoding class:param d_model: dimension of model
:param max_len: max sequence length
:param device: hardware device setting
"""
super(PositionalEncoding, self).__init__()# same size with input matrix (for adding with input matrix)
self.encoding = torch.zeros(max_len, d_model, device=device)
self.encoding.requires_grad = False # we don't need to compute gradientpos = torch.arange(0, max_len, device=device)
pos = pos.float().unsqueeze(dim=1)
# 1D => 2D unsqueeze to represent word's position_2i = torch.arange(0, d_model, step=2, device=device).float()
# 'i' means index of d_model (e.g. embedding size = 50, 'i' = [0,50])
# "step=2" means 'i' multiplied with two (same with 2 * i)self.encoding[:, 0::2] = torch.sin(pos / (10000 ** (_2i / d_model)))
self.encoding[:, 1::2] = torch.cos(pos / (10000 ** (_2i / d_model)))
# compute positional encoding to consider positional information of wordsdef forward(self, x):
# self.encoding
# [max_len = 512, d_model = 512]batch_size, seq_len = x.size()
# [batch_size = 128, seq_len = 30]return self.encoding[:seq_len, :]
# [seq_len = 30, d_model = 512]
# it will add with tok_emb : [128, 30, 512]
```### 1.2 Multi-Head Attention
![model](image/multi_head_attention.jpg)
```python
class MultiHeadAttention(nn.Module):def __init__(self, d_model, n_head):
super(MultiHeadAttention, self).__init__()
self.n_head = n_head
self.attention = ScaleDotProductAttention()
self.w_q = nn.Linear(d_model, d_model)
self.w_k = nn.Linear(d_model, d_model)
self.w_v = nn.Linear(d_model, d_model)
self.w_concat = nn.Linear(d_model, d_model)def forward(self, q, k, v, mask=None):
# 1. dot product with weight matrices
q, k, v = self.w_q(q), self.w_k(k), self.w_v(v)# 2. split tensor by number of heads
q, k, v = self.split(q), self.split(k), self.split(v)# 3. do scale dot product to compute similarity
out, attention = self.attention(q, k, v, mask=mask)
# 4. concat and pass to linear layer
out = self.concat(out)
out = self.w_concat(out)# 5. visualize attention map
# TODO : we should implement visualizationreturn out
def split(self, tensor):
"""
split tensor by number of head:param tensor: [batch_size, length, d_model]
:return: [batch_size, head, length, d_tensor]
"""
batch_size, length, d_model = tensor.size()d_tensor = d_model // self.n_head
tensor = tensor.view(batch_size, length, self.n_head, d_tensor).transpose(1, 2)
# it is similar with group convolution (split by number of heads)return tensor
def concat(self, tensor):
"""
inverse function of self.split(tensor : torch.Tensor):param tensor: [batch_size, head, length, d_tensor]
:return: [batch_size, length, d_model]
"""
batch_size, head, length, d_tensor = tensor.size()
d_model = head * d_tensortensor = tensor.transpose(1, 2).contiguous().view(batch_size, length, d_model)
return tensor
```### 1.3 Scale Dot Product Attention
![model](image/scale_dot_product_attention.jpg)
```python
class ScaleDotProductAttention(nn.Module):
"""
compute scale dot product attentionQuery : given sentence that we focused on (decoder)
Key : every sentence to check relationship with Qeury(encoder)
Value : every sentence same with Key (encoder)
"""def __init__(self):
super(ScaleDotProductAttention, self).__init__()
self.softmax = nn.Softmax(dim=-1)def forward(self, q, k, v, mask=None, e=1e-12):
# input is 4 dimension tensor
# [batch_size, head, length, d_tensor]
batch_size, head, length, d_tensor = k.size()# 1. dot product Query with Key^T to compute similarity
k_t = k.transpose(2, 3) # transpose
score = (q @ k_t) / math.sqrt(d_tensor) # scaled dot product# 2. apply masking (opt)
if mask is not None:
score = score.masked_fill(mask == 0, -10000)# 3. pass them softmax to make [0, 1] range
score = self.softmax(score)# 4. multiply with Value
v = score @ vreturn v, score
```### 1.4 Layer Norm
![model](image/layer_norm.jpg)
```python
class LayerNorm(nn.Module):
def __init__(self, d_model, eps=1e-12):
super(LayerNorm, self).__init__()
self.gamma = nn.Parameter(torch.ones(d_model))
self.beta = nn.Parameter(torch.zeros(d_model))
self.eps = epsdef forward(self, x):
mean = x.mean(-1, keepdim=True)
var = x.var(-1, unbiased=False, keepdim=True)
# '-1' means last dimension.out = (x - mean) / torch.sqrt(var + self.eps)
out = self.gamma * out + self.beta
return out```
### 1.5 Positionwise Feed Forward
![model](image/positionwise_feed_forward.jpg)
```pythonclass PositionwiseFeedForward(nn.Module):
def __init__(self, d_model, hidden, drop_prob=0.1):
super(PositionwiseFeedForward, self).__init__()
self.linear1 = nn.Linear(d_model, hidden)
self.linear2 = nn.Linear(hidden, d_model)
self.relu = nn.ReLU()
self.dropout = nn.Dropout(p=drop_prob)def forward(self, x):
x = self.linear1(x)
x = self.relu(x)
x = self.dropout(x)
x = self.linear2(x)
return x
```### 1.6 Encoder & Decoder Structure
![model](image/enc_dec.jpg)
```python
class EncoderLayer(nn.Module):def __init__(self, d_model, ffn_hidden, n_head, drop_prob):
super(EncoderLayer, self).__init__()
self.attention = MultiHeadAttention(d_model=d_model, n_head=n_head)
self.norm1 = LayerNorm(d_model=d_model)
self.dropout1 = nn.Dropout(p=drop_prob)self.ffn = PositionwiseFeedForward(d_model=d_model, hidden=ffn_hidden, drop_prob=drop_prob)
self.norm2 = LayerNorm(d_model=d_model)
self.dropout2 = nn.Dropout(p=drop_prob)def forward(self, x, src_mask):
# 1. compute self attention
_x = x
x = self.attention(q=x, k=x, v=x, mask=src_mask)
# 2. add and norm
x = self.dropout1(x)
x = self.norm1(x + _x)
# 3. positionwise feed forward network
_x = x
x = self.ffn(x)
# 4. add and norm
x = self.dropout2(x)
x = self.norm2(x + _x)
return x
``````python
class Encoder(nn.Module):def __init__(self, enc_voc_size, max_len, d_model, ffn_hidden, n_head, n_layers, drop_prob, device):
super().__init__()
self.emb = TransformerEmbedding(d_model=d_model,
max_len=max_len,
vocab_size=enc_voc_size,
drop_prob=drop_prob,
device=device)self.layers = nn.ModuleList([EncoderLayer(d_model=d_model,
ffn_hidden=ffn_hidden,
n_head=n_head,
drop_prob=drop_prob)
for _ in range(n_layers)])def forward(self, x, src_mask):
x = self.emb(x)for layer in self.layers:
x = layer(x, src_mask)return x
``````python
class DecoderLayer(nn.Module):def __init__(self, d_model, ffn_hidden, n_head, drop_prob):
super(DecoderLayer, self).__init__()
self.self_attention = MultiHeadAttention(d_model=d_model, n_head=n_head)
self.norm1 = LayerNorm(d_model=d_model)
self.dropout1 = nn.Dropout(p=drop_prob)self.enc_dec_attention = MultiHeadAttention(d_model=d_model, n_head=n_head)
self.norm2 = LayerNorm(d_model=d_model)
self.dropout2 = nn.Dropout(p=drop_prob)self.ffn = PositionwiseFeedForward(d_model=d_model, hidden=ffn_hidden, drop_prob=drop_prob)
self.norm3 = LayerNorm(d_model=d_model)
self.dropout3 = nn.Dropout(p=drop_prob)def forward(self, dec, enc, trg_mask, src_mask):
# 1. compute self attention
_x = dec
x = self.self_attention(q=dec, k=dec, v=dec, mask=trg_mask)
# 2. add and norm
x = self.dropout1(x)
x = self.norm1(x + _x)if enc is not None:
# 3. compute encoder - decoder attention
_x = x
x = self.enc_dec_attention(q=x, k=enc, v=enc, mask=src_mask)
# 4. add and norm
x = self.dropout2(x)
x = self.norm2(x + _x)# 5. positionwise feed forward network
_x = x
x = self.ffn(x)
# 6. add and norm
x = self.dropout3(x)
x = self.norm3(x + _x)
return x
``````python
class Decoder(nn.Module):
def __init__(self, dec_voc_size, max_len, d_model, ffn_hidden, n_head, n_layers, drop_prob, device):
super().__init__()
self.emb = TransformerEmbedding(d_model=d_model,
drop_prob=drop_prob,
max_len=max_len,
vocab_size=dec_voc_size,
device=device)self.layers = nn.ModuleList([DecoderLayer(d_model=d_model,
ffn_hidden=ffn_hidden,
n_head=n_head,
drop_prob=drop_prob)
for _ in range(n_layers)])self.linear = nn.Linear(d_model, dec_voc_size)
def forward(self, trg, src, trg_mask, src_mask):
trg = self.emb(trg)for layer in self.layers:
trg = layer(trg, src, trg_mask, src_mask)# pass to LM head
output = self.linear(trg)
return output
```## 2. Experiments
I use Multi30K Dataset to train and evaluate model
You can check detail of dataset [here](https://arxiv.org/abs/1605.00459)
I follow original paper's parameter settings. (below)![conf](image/transformer-model-size.jpg)
### 2.1 Model Specification* total parameters = 55,207,087
* model size = 215.7MB
* lr scheduling : ReduceLROnPlateau#### 2.1.1 configuration
* batch_size = 128
* max_len = 256
* d_model = 512
* n_layers = 6
* n_heads = 8
* ffn_hidden = 2048
* drop_prob = 0.1
* init_lr = 0.1
* factor = 0.9
* patience = 10
* warmup = 100
* adam_eps = 5e-9
* epoch = 1000
* clip = 1
* weight_decay = 5e-4### 2.2 Training Result
![image](saved/transformer-base/train_result.jpg)
* Minimum Training Loss = 2.852672759656864
* Minimum Validation Loss = 3.2048025131225586| Model | Dataset | BLEU Score |
|:---:|:---:|:---:|
| Original Paper's | WMT14 EN-DE | 25.8 |
| My Implementation | Multi30K EN-DE | 26.4 |
## 3. Reference
- [Attention is All You Need, 2017 - Google](https://arxiv.org/abs/1706.03762)
- [The Illustrated Transformer - Jay Alammar](http://jalammar.github.io/illustrated-transformer/)
- [Data & Optimization Code Reference - Bentrevett](https://github.com/bentrevett/pytorch-seq2seq/)
## 4. Licence
Copyright 2019 Hyunwoong Ko.
Licensed under the Apache License, Version 2.0 (the "License");
you may not use this file except in compliance with the License.
You may obtain a copy of the License at
http://www.apache.org/licenses/LICENSE-2.0
Unless required by applicable law or agreed to in writing, software
distributed under the License is distributed on an "AS IS" BASIS,
WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
See the License for the specific language governing permissions and
limitations under the License.