Seq2Seq模型及注意力机制介绍

在文章循环神经网络RNN介绍中提到了一种非同步输出的many to many的结构，它输入的是一个序列，输出的也是一个序列，但两个序列的长度不一定相等，这种结构也叫做Seq2Seq模型。如下图所示，通常使用编码器-解码器 (Encoder-Decoder) 架构对其进行搭建：

编码器对输入进行编码，并将得到的最终状态输入到解码器中，由解码器进行解码获得对应的输出。在Seq2Seq模型中编码器和解码器通常选择RNN网络或者它的变体 (例如LSTM、GRU)。但是这种Seq2Seq模型还存在一个问题，以上图中的例子为例，它将“Hello world.”翻译为“Bonjour le monde.”，实际上是“Bonjour”对应“Hello”，“monde” 对应“world”，但是Seq2Seq模型只是将输入的句子转化为一个最终状态输入到解码器中，很难直接学习到这种对应关系。为解决这一问题，需引入注意力机制 (Attention Mechanism) 加入到Seq2Seq模型中。

注意力机制

注意力机制的目的是在生成输出序列的每一步时考虑整个输入序列，选出对该输出步影响最大的关键输入部分，而忽略其它不重要的输入部分。注意力机制的结构如左图所示，将注意力层的输入记为一个query(用 $\boldsymbol{q}$ 表示)，将输出output记为 $\boldsymbol{o}$ ，将嵌入注意力层的网络的记忆表示为 $n$ 个键-值(key-value)对，记为 $\boldsymbol{(k_1, v_1), (k_2, v_2), \cdots, (k_n, v_n)}$ 。

注意力层的具体计算方法如右图所示，对每个输入 $\boldsymbol{q}$ ，结合网络记忆 $\boldsymbol{k_i}$ 计算出权重系数 $\alpha_i$ ： $\large r_i = F(\boldsymbol{q}, \boldsymbol{k_i}),\space\space \alpha_i=\frac{\large e^{r_i}}{\large \sum_{j=1}^n e^{r_j}},\space\space i=1,2,\cdots,n$ 其中函数 $F$ 通常为一个浅层的神经网络。计算出权重系数后注意力层的输出可以写为 $\large \boldsymbol{o}=\sum_{i=1}^{n}\alpha_i \boldsymbol{v_i}$

接下来通过一个实例介绍在Seq2Seq模型中如何引入注意力机制，该实例的目的是将不同类型的英文日期表述转化为固定的格式"YYYY-MM-DD"，例如：

使用的数据集可以点击此链接下载，提取码为4y8l。首先进行数据读取和处理：

点击查看代码

import numpy as np
from keras.utils import to_categorical

with open('date.txt','r') as f:  
    dataset = eval(f.read())
### character dict for human date format    
human_vocab = {' ': 0, '.': 1, '/': 2, '0': 3, '1': 4, '2': 5, '3': 6, '4': 7, '5': 8, '6': 9, '7': 10, '8': 11, '9': 12,
               '<pad>': 36, '<unk>': 35, 'a': 13, 'b': 14, 'c': 15, 'd': 16, 'e': 17, 'f': 18, 'g': 19, 'h': 20, 'i': 21,
               'j': 22, 'l': 23, 'm': 24, 'n': 25, 'o': 26, 'p': 27, 'r': 28, 's': 29, 't': 30, 'u': 31, 'v': 32, 'w': 33, 'y': 34}
### character dict for machine date format YYYY-MM-DD
machine_vocab = {'-': 0, '0': 1, '1': 2, '2': 3, '3': 4, '4': 5, '5': 6, '6': 7, '7': 8, '8': 9, '9': 10}
inv_machine_vocab = {0: '-', 1: '0', 2: '1', 3: '2', 4: '3', 5: '4', 6: '5', 7: '6', 8: '7', 9: '8', 10: '9'}
### process data
def string_to_int(string, length, vocab, is_human):
    string = string.lower()
    string = string.replace(',','')
    if len(string) > length:
        string = string[:length] #truncate string if string is too long         
    # transform string to a list of integers. if character not in the vocab, replace it with <unk> 
    rep = list(map(lambda x: vocab.get(x, vocab['<unk>']), string)) if is_human else list(map(lambda x: vocab.get(x), string))
    # pad short string with <pad>
    if len(string) < length:
        rep += [vocab['<pad>']] * (length - len(string))
    return rep

def preprocess_data(dataset, human_vocab, machine_vocab, Tx, Ty):
    m = len(dataset)
    X, Y = zip(*dataset) #(m,), (m,)
    X = np.array([string_to_int(i, Tx, human_vocab, is_human=True) for i in X]) #(m, Tx)
    Y = np.array([string_to_int(t, Ty, machine_vocab, is_human=False) for t in Y]) #(m, Ty)
    Xoh = np.array(list(map(lambda x: to_categorical(x, num_classes=len(human_vocab)), X))) #(m, Tx, len(human_vocab))
    Yoh = np.array(list(map(lambda x: to_categorical(x, num_classes=len(machine_vocab)), Y))) #(m, Ty, len(machine_vocab))
    return X, Y, Xoh, Yoh

Tx, Ty = 30, 10
X, Y, Xoh, Yoh = preprocess_data(dataset, human_vocab, machine_vocab, Tx, Ty)

import numpy as np

from keras.utils import to_categorical

with open('date.txt','r') as f:

dataset = eval(f.read())

### character dict for human date format

human_vocab = {' ': 0, '.': 1, '/': 2, '0': 3, '1': 4, '2': 5, '3': 6, '4': 7, '5': 8, '6': 9, '7': 10, '8': 11, '9': 12,

'<pad>': 36, '<unk>': 35, 'a': 13, 'b': 14, 'c': 15, 'd': 16, 'e': 17, 'f': 18, 'g': 19, 'h': 20, 'i': 21,

'j': 22, 'l': 23, 'm': 24, 'n': 25, 'o': 26, 'p': 27, 'r': 28, 's': 29, 't': 30, 'u': 31, 'v': 32, 'w': 33, 'y': 34}

### character dict for machine date format YYYY-MM-DD

machine_vocab = {'-': 0, '0': 1, '1': 2, '2': 3, '3': 4, '4': 5, '5': 6, '6': 7, '7': 8, '8': 9, '9': 10}

inv_machine_vocab = {0: '-', 1: '0', 2: '1', 3: '2', 4: '3', 5: '4', 6: '5', 7: '6', 8: '7', 9: '8', 10: '9'}

### process data

def string_to_int(string, length, vocab, is_human):

string = string.lower()

string = string.replace(',','')

if len(string) > length:

string = string[:length] #truncate string if string is too long

# transform string to a list of integers. if character not in the vocab, replace it with <unk>

rep = list(map(lambda x: vocab.get(x, vocab['<unk>']), string)) if is_human else list(map(lambda x: vocab.get(x), string))

# pad short string with <pad>

if len(string) < length:

rep += [vocab['<pad>']] * (length - len(string))

return rep

def preprocess_data(dataset, human_vocab, machine_vocab, Tx, Ty):

m = len(dataset)

X, Y = zip(*dataset) #(m,), (m,)

X = np.array([string_to_int(i, Tx, human_vocab, is_human=True) for i in X]) #(m, Tx)

Y = np.array([string_to_int(t, Ty, machine_vocab, is_human=False) for t in Y]) #(m, Ty)

Xoh = np.array(list(map(lambda x: to_categorical(x, num_classes=len(human_vocab)), X))) #(m, Tx, len(human_vocab))

Yoh = np.array(list(map(lambda x: to_categorical(x, num_classes=len(machine_vocab)), Y))) #(m, Ty, len(machine_vocab))

return X, Y, Xoh, Yoh

Tx, Ty = 30, 10

X, Y, Xoh, Yoh = preprocess_data(dataset, human_vocab, machine_vocab, Tx, Ty)

该问题的模型架构如下图所示，它的编码器部分由双向(Bidirectional) LSTM网络组成，每步输出为 $\boldsymbol{a^{}}=\left[\boldsymbol{\overrightarrow{a}^{}}, \boldsymbol{\overleftarrow{a}^{}}\right]$ ，其中 $\boldsymbol{\overrightarrow{a}^{}}$ 为正向LSTM的输出， $\boldsymbol{\overleftarrow{a}^{}}$ 为反向LSTM的输出；解码器部分由LSTM网络组成，每步的输出为 $\boldsymbol{s^{}}$ ，并通过全连接层转化为最终的输出 $\boldsymbol{y^{}}$ 。此外，该架构中还引入了注意力机制，注意力层的输出为 $\boldsymbol{context^{}}$ 。一般来说，解码器每步也会有自己的输入 $\boldsymbol{d^{}}$ ，引入注意力机制后解码器的输入变为 $\left[\boldsymbol{d^{}}, \boldsymbol{context^{}}\right]$ ，这里为了简化模型在该问题中没有考虑 $\boldsymbol{d^{}}$ 。

模型整体架构的代码为：

点击查看代码

from keras.layers import Bidirectional, Input, LSTM, Dense
from keras.models import Model

### Defined shared layers as global variables
n_a = 32 # number of units for the pre-attention, bi-directional LSTM's hidden state 'a'
n_s = 64 # number of units for the post-attention LSTM's hidden state "s"
post_attention_LSTM_cell = LSTM(n_s, return_state = True) # post-attention LSTM 
output_layer = Dense(len(machine_vocab), activation="softmax")

### Model architecture
def model(Tx, Ty, n_a, n_s, human_vocab_size, machine_vocab_size):
    X = Input(shape=(Tx, human_vocab_size))
    s0 = Input(shape=(n_s,), name='s0') #define s0 (initial hidden state)
    c0 = Input(shape=(n_s,), name='c0') #define c0 (initial cell state)
    s = s0 
    c = c0
    outputs = [] # Initialize empty list of outputs, final shape: Ty np arrays with shape (m, len(machine_vocab))        
    # Step 1: Define pre-attention Bi-LSTM. 
    a = Bidirectional(LSTM(units=n_a, return_sequences=True))(X) #(m, Tx, 2*n_a)  
    # Step 2: Iterate for Ty steps
    for t in range(Ty):
        # Step 2.A: Perform one step of the attention mechanism to get the context vector at step t
        context = one_step_attention(a, s) #the attention layer is defined below, (m, 1, 2*n_a)
        # Step 2.B: Apply the post-attention LSTM cell to the context vector
        s, _, c = post_attention_LSTM_cell(context, initial_state=[s,c]) #(m, n_s), (m, n_s)
        # Step 2.C: Apply Dense layer to the hidden state output of the post-attention LSTM
        out = output_layer(s) #(m, len(machine_vocab))
        # Step 2.D: Append "out" to the "outputs" list 
        outputs.append(out)
    # Step 3: Create model instance
    model = Model(inputs=[X,s0,c0], outputs=outputs)    
    return model

from keras.layers import Bidirectional, Input, LSTM, Dense

from keras.models import Model

### Defined shared layers as global variables

n_a = 32 # number of units for the pre-attention, bi-directional LSTM's hidden state 'a'

n_s = 64 # number of units for the post-attention LSTM's hidden state "s"

post_attention_LSTM_cell = LSTM(n_s, return_state = True) # post-attention LSTM

output_layer = Dense(len(machine_vocab), activation="softmax")

### Model architecture

def model(Tx, Ty, n_a, n_s, human_vocab_size, machine_vocab_size):

X = Input(shape=(Tx, human_vocab_size))

s0 = Input(shape=(n_s,), name='s0') #define s0 (initial hidden state)

c0 = Input(shape=(n_s,), name='c0') #define c0 (initial cell state)

s = s0

c = c0

outputs = [] # Initialize empty list of outputs, final shape: Ty np arrays with shape (m, len(machine_vocab))

# Step 1: Define pre-attention Bi-LSTM.

a = Bidirectional(LSTM(units=n_a, return_sequences=True))(X) #(m, Tx, 2*n_a)

# Step 2: Iterate for Ty steps

for t in range(Ty):

# Step 2.A: Perform one step of the attention mechanism to get the context vector at step t

context = one_step_attention(a, s) #the attention layer is defined below, (m, 1, 2*n_a)

# Step 2.B: Apply the post-attention LSTM cell to the context vector

s, _, c = post_attention_LSTM_cell(context, initial_state=[s,c]) #(m, n_s), (m, n_s)

# Step 2.C: Apply Dense layer to the hidden state output of the post-attention LSTM

out = output_layer(s) #(m, len(machine_vocab))

# Step 2.D: Append "out" to the "outputs" list

outputs.append(out)

# Step 3: Create model instance

model = Model(inputs=[X,s0,c0], outputs=outputs)

return model

针对注意力层的计算过程如下图所示：

类比文章上一部分提到的注意力层的计算方法，此时网络记忆可写为 $\boldsymbol{k_{t^\prime}}=\boldsymbol{v_{t^\prime}}=\boldsymbol{a^{<t^\prime>}},\space\space t^\prime =1,2,\cdots,T_x$ 注意力层每步的输入为解码器中LSTM的前一步的输出，即 $\boldsymbol{q_t}=\boldsymbol{s^{<t-1>}}, \space\space t=1,2,\cdots,T_y$ 注意力层每步的输出 $\boldsymbol{o_t}=\boldsymbol{context^{<t>}}$ ，计算公式为 $\begin{cases}r^{<t,t^\prime>}=F(\boldsymbol{q_t}, \boldsymbol{k_{t^\prime}})=F(\boldsymbol{s^{<t-1>}}, \boldsymbol{a^{<t^\prime>}}), \space\space t^\prime = 1,2,\cdots,T_x \\ \alpha^{<t,t^\prime>}=\frac{\large e^{r^{<t,t^\prime>}}}{\large \sum_{j=1}^{T_x} e^{r^{<t,j>}}},\space\space t^\prime =1,2,\cdots,T_x \\ \boldsymbol{context^{<t>}} = \sum_{t^\prime =1}^{T_x}\alpha^{<t,t^\prime>}\boldsymbol{v_{t^\prime}} = \sum_{t^\prime =1}^{T_x}\alpha^{<t,t^\prime>}\boldsymbol{a^{<t^\prime>}}\end{cases}$ 其中 $F$ 为一个浅层的全连接神经网络。注意力层的代码为：

点击查看代码

from keras.layers import Concatenate, Dot, RepeatVector, Activation 
import keras.backend as K

### Define a custom softmax function for convenient use in attention layer
def softmax(x, axis=1):
    """
    Arguments:
    x -- Tensor.
    axis -- Integer, axis along which the softmax normalization is applied.
    """
    ndim = K.ndim(x)
    if ndim == 2:
        return K.softmax(x)
    elif ndim > 2:
        e = K.exp(x - K.max(x, axis=axis, keepdims=True))
        s = K.sum(e, axis=axis, keepdims=True)
        return e / s
    else:
        raise ValueError('Cannot apply softmax to a tensor that is 1D')
        
### Defined shared layers as global variables
repeator = RepeatVector(Tx)
concatenator = Concatenate(axis=-1)
# [densor1,densor2] is F
densor1 = Dense(10, activation = "tanh") 
densor2 = Dense(1, activation = "relu")
activator = Activation(softmax, name='attention_weights') #using a custom softmax(axis = 1)
dotor = Dot(axes = 1)

### Perform one step of attention
def one_step_attention(a, s_prev):
    """    
    Arguments:
    a -- hidden state output of the Bi-LSTM, numpy-array of shape (m, Tx, 2*n_a)
    s_prev -- previous hidden state of the (post-attention) LSTM, numpy-array of shape (m, n_s)
    """
    s_prev = repeator(s_prev) #(m, Tx, n_s)
    concat = concatenator([a, s_prev]) #(m, Tx, 2*n_a+n_s)
    e = densor1(concat) #(m, Tx, 10)
    energies = densor2(e) #(m, Tx, 1)
    alphas = activator(energies) # compute the attention weights, (m, Tx, 1)
    context = dotor([alphas, a]) #(m, 1, 2*n_a) 
    return context

from keras.layers import Concatenate, Dot, RepeatVector, Activation

import keras.backend as K

### Define a custom softmax function for convenient use in attention layer

def softmax(x, axis=1):

"""

Arguments:

x -- Tensor.

axis -- Integer, axis along which the softmax normalization is applied.

"""

ndim = K.ndim(x)

if ndim == 2:

return K.softmax(x)

elif ndim > 2:

e = K.exp(x - K.max(x, axis=axis, keepdims=True))

s = K.sum(e, axis=axis, keepdims=True)

return e / s

else:

raise ValueError('Cannot apply softmax to a tensor that is 1D')

### Defined shared layers as global variables

repeator = RepeatVector(Tx)

concatenator = Concatenate(axis=-1)

# [densor1,densor2] is F

densor1 = Dense(10, activation = "tanh")

densor2 = Dense(1, activation = "relu")

activator = Activation(softmax, name='attention_weights') #using a custom softmax(axis = 1)

dotor = Dot(axes = 1)

### Perform one step of attention

def one_step_attention(a, s_prev):

"""

Arguments:

a -- hidden state output of the Bi-LSTM, numpy-array of shape (m, Tx, 2*n_a)

s_prev -- previous hidden state of the (post-attention) LSTM, numpy-array of shape (m, n_s)

"""

s_prev = repeator(s_prev) #(m, Tx, n_s)

concat = concatenator([a, s_prev]) #(m, Tx, 2*n_a+n_s)

e = densor1(concat) #(m, Tx, 10)

energies = densor2(e) #(m, Tx, 1)

alphas = activator(energies) # compute the attention weights, (m, Tx, 1)

context = dotor([alphas, a]) #(m, 1, 2*n_a)

return context

模型训练和预测的代码为：

from keras.optimizers import Adam

### Train
model = model(Tx, Ty, n_a, n_s, len(human_vocab), len(machine_vocab))
model.summary()
opt = Adam(lr=0.005, beta_1=0.9, beta_2=0.999, decay=0.01)
model.compile(optimizer=opt, loss='categorical_crossentropy', metrics=['accuracy'])
s0 = np.zeros((m, n_s))
c0 = np.zeros((m, n_s))
outputs = list(Yoh.swapaxes(0,1)) #(m, Ty, len(machine_vocab)) --> Ty np arrays with shape (m, len(machine_vocab)) 
model.fit([Xoh, s0, c0], outputs, epochs=30, batch_size=100)

### Test
EXAMPLES = ['3 May 1979', '3rd of August 2016', 'Tue 16 Jul 2007', 'Saturday May 9 2018', 'March 3 2001', 
            'Jun 9th 2001', '09/19/1996']
for example in EXAMPLES:   
    source = string_to_int(example, Tx, human_vocab, is_human=True)
    source = np.array(list(map(lambda x: to_categorical(x, num_classes=len(human_vocab)), source)))[None,:,:] #(1, Tx, len(human_vocab))
    prediction = model.predict([source, s0, c0]) #a list of Ty np arrays with shape (1, len(machine_vocab))
    prediction = np.argmax(prediction, axis = -1) #(Ty, 1)
    output = [inv_machine_vocab[int(i)] for i in prediction]
    print("source:", example)
    print("output:", ''.join(output),"\n")

from keras.optimizers import Adam

### Train

model = model(Tx, Ty, n_a, n_s, len(human_vocab), len(machine_vocab))

model.summary()

opt = Adam(lr=0.005, beta_1=0.9, beta_2=0.999, decay=0.01)

model.compile(optimizer=opt, loss='categorical_crossentropy', metrics=['accuracy'])

s0 = np.zeros((m, n_s))

c0 = np.zeros((m, n_s))

outputs = list(Yoh.swapaxes(0,1)) #(m, Ty, len(machine_vocab)) --> Ty np arrays with shape (m, len(machine_vocab))

model.fit([Xoh, s0, c0], outputs, epochs=30, batch_size=100)

### Test

EXAMPLES = ['3 May 1979', '3rd of August 2016', 'Tue 16 Jul 2007', 'Saturday May 9 2018', 'March 3 2001',

'Jun 9th 2001', '09/19/1996']

for example in EXAMPLES:

source = string_to_int(example, Tx, human_vocab, is_human=True)

source = np.array(list(map(lambda x: to_categorical(x, num_classes=len(human_vocab)), source)))[None,:,:] #(1, Tx, len(human_vocab))

prediction = model.predict([source, s0, c0]) #a list of Ty np arrays with shape (1, len(machine_vocab))

prediction = np.argmax(prediction, axis = -1) #(Ty, 1)

output = [inv_machine_vocab[int(i)] for i in prediction]

print("source:", example)

print("output:", ''.join(output),"\n")

注意力机制

Leave a Comment Cancel Reply