Statistician's Deeplearning

Dynamic Few-Shot Visual Learning without Forgetting

Thu, 09 Aug 2018 17:00:00 +0900

시작하며

데이터 분석에 있어서 비지니스 환경 하에서는 언제나 imbalance 문제에 시달리게 됩니다. 새로운 상품은 계속 나오고 그 상품이 충분히 팔리기 전에 뭔가를 알고 싶어하는 마음이 큰 거죠. 그런 정보가 있으면 보다 효과적으로 새로운 상품을 고객에게 appeal할 수 있을 거니깐요.

Few shot learning은 이러한 문제를 해결하기에 아주 적합한 모형입니다. 되기만 하면 말이죠. 그러한 가능성을 tapping해보는 관점에서 Dynamic Few-shot Visual Learning without Forgetting을 읽는 중에 정리도 할겸 posting 하고자 합니다. Few-shot learning은 약 2~3년 전쯤에 한참 화제가 되다가 한동안 잠잠했었는데요. 올해 초부터 다시 화제가 되고 있는 분야입니다. 그 중심에서는 BAIR group에서 발표한 MAML과 SNAIL이 있습니다. MAML이 초기값 설정, SNAIL이 episode를 학습시키는 커다란 두 줄기의 few shot learning을 대표하는 방법이라면, 이 논문에서 제안하는 방법은 이들과는 다른 좀 새로운 시도로 보입니다.

Few-shot learning

논문에서는 기존의 transfer learning을 비롯해서 여러가지 few shot learning 방법들이 학습의 속도와 catastrophic forgetting 문제는 제대로 해결하지 못했다고 이야기 합니다. 그래서 이 논문에서는 그 문제를 해결하고자 했는데요. 성능을 보면 기존의 방법들보다 많이 좋기는 합니다.

간단하게 요약해서 알고리즘을 설명하자면, 신경망의 weight들이 각 class별의 특성을 보유하고 있고, 따라서, 새로운 범주(novel category)로 분류해야 할 때, 몇개의 sample의 feature map을 활용하여, 새로운 범주로 분류하기 위한 신경망을 구축한다는 것입니다. 새로운 category의 weight값을 구하는 방법은 가장 쉽게는 feature map 값들의 평균의 함수로 구할 수도 있지만, 이 논문에서는 attention mechanism을 사용하는 것을 제안하고 있습니다. (Few-shot classification-weight generator based on attention). 이러한 과정은 이전의 경험으로부터 학습하는 인간의 학습과정과 닮아 있다고 합니다만, 이렇게 weight를 직접 조정하는 방법이 전혀 다른 domain의 분류 문제들에서는 어떤 성능을 발휘할지 잘 모르겠기는 합니다. 아마도 새로운 문제에 적용하기 보다는 이미 존재하는 문제를 보다 많은 category의 분류 문제로 확장하는 데에 도움이 되는 방법인 것 같습니다.

결국 새로운 weight를 update해서 새로운 범주를 분류할 수 있는 weight를 구한 다음, 이들을 모두 하나의 신경망으로 간주하고 기본 범주와 새로운 범주를 하나의 문제로 놓고 분류 문제를 풀겠다는 것입니다. 이 때 각각 다른 경로를 통해서 얻은 weight는 서로 다른 scale을 가지고 있을 수가 있는데 이를 scale에 구애받지 않고 합치기 위해서 신경망의 마지막 layer를 단순 weight과의 내적이 아니라 cosine similarity를 구해서 분류하게 됩니다.

Methodology

먼저 기본적인 셋업은 다음과 같이 데이터의 형태로 표현할 수 있습니다. 기본적으로 이 방법은 2단계 학습 구조를 가지고 있습니다. 첫 번째 학습에는 많은 데이터를 이용해서 base category를 잘 구분해 낼 수 있는 base category classifier를 학습힙니다. Base category를 구분하기 위해 쓰인 train data를 다음과 같이 나타냅니다.

$D_{train} = \bigcup_{b=1}^{K_{base} } \left\{ x_{b,i} \right\}_{i=1}^{N_b}$

이렇게 base category로 base classifier를 학습한 후에는 데이터가 많지 않은 novel category를 위해 2차 학습을 진행하게 되는데, 여기에 사용되는 데이터는 다음과 같이 표현합니다.

$D_{novel} = \bigcup_{n=1}^{K_{novel} } \left\{ x'_{n,i} \right\}_{i=1}^{N'_n}$

ConvNet-base recognition model

분류 작업은 CNN 기반의 신경망을 통해 이루어집니다. $K_{base}$개의 범주를 label로 두고 back prop으로 학습하는 일반적인 신경망과 다를 것이 없지만 1가지 다른 점은, 마지막 layer가 cosine similarity로 이루어진다는 점입니다. 만약 5개의 layer가 있으면 이 논문에서는 앞의 4개 layer를 feature extractor라고 부르고, 마지막 1개 layer를 classifier라고 부릅니다. 일반적으로도 그렇게 부르기는 하지만, 보다 명확한 이해를 위해 이를 다시 한번 강조했다고 볼 수 있습니다.

Feature extractor는 모수, $\theta$,를 가지는 신경망, $F(\cdot \vert \theta)$,로 표현합니다. Classifier는 $K_{base}$ 개의 base class에 대한 weight, $W_k^{ * }$,을 모두 포함하는 모수 집합, $W^{ * } = { w_k^* \in \mathbb R^d }_{k= 1}^{K^{ * }}$,을 전체 모수로 가지는 $C(\cdot \vert W^{ * })$로 표현을 합니다. Classifier는 결국 $K^{ * }$개의 classification vector로 이루어져 있고, 이 classification vector가 각 base class에 대한 정보를 압축하고 있다고 간주합니다. 이 classifier를 통과 하면 $K^{ * }$의 길이를 가지는 score값 vector, $p = C(z\vert W^{ * })$를 최종적으로 얻습니다.

Base category와 novel category는 feature extractor를 공유합니다. 이렇게 공유된 feature extractor를 통과한 결과를 비교해서 비슷한 class의 weight vector를 새로운 데이터로 업데이트 해서 novel category의 classifier로 사용합니다.

단일 training에서는 base category를 잘 분류하는 classifier를 찾는 것, 다시 말하면 최적화된 모수집합 $W^*$를 찾는 데 주력합니다.

위에서 언급한 것과 같이 이 논문에서 사용된 classifier는 기존의 classifier와는 달리 cosine similarity를 가집니다. 데이터가 이미 feature extractor를 통과해서 $z$라는 feature extract(혹은 일반적으로는 feature map)을 얻었다고 하면, 일반적으로는 $p=C(z\vert W^*)$를 구하기 위해서 먼저 score $s_k, k =1 , \ldots, K_{base}$를 구하고 이들 score를 softmax layer를 통과시켜 각 category에 대한 최종 확률을 얻습니다. 1차 학습이 아니라 신규 범주가 포함되어 있는 2차 학습의 경우 weight vector를 구하는 방법은 base category와 novel category에 따라 크게 다릅니다. base category에 대한 weight들은 아주 많은 데이터로 서서히 학습이 된 안정적인 weight라면 weight generator를 통해 학습된 novel category의 weight값은 아주 작은 data로 학습이 되는만큼 안정성이 떨어지는 weight일 것입니다. 그러므로 만약 classifier에서 weight vector와 feature vector, $z$를 그냥 곱한후에 softmax를 취하게 되면 하나의 category에 몰리는 현상이 발생할 수도 있습니다. 이런 경우를 대비해서, 이 논문에서는 weight과 feature extract를 모두 normalize한 후에 내적을 취함으로써 cosine similarity를 구하는 것과 같은 연산을 진행합니다.

이렇게 함으로써 서로 다른 pipeline으로부터 얻은 weight vector의 scale에 영향을 받지 않고 분류를 진행할 수 있습니다. 이 부분이 이 논문의 가장 큰 contribution 중의 하나라고 저자들은 이야기 합니다.

$s_k = \tau \cdot \cos(z, w_k^*) = \tau \cdot \bar z^T \bar w_k^*, \textrm{ , where } \bar z = \frac z {\| z \|}, \bar w_k^* = \frac{w_k^*}{\|w_k^* \|}$

여기서 $\tau$는 또다른 모수로서 scalar값을 지닙니다. 오로지 방향성만 감안하겠다는 뜻으로 weight vector의 norm에는 의존하지 않습니다. 마지막에 relu layer를 태우지 않아도 non-linearity를 잃지 않을 수 있고, 양수와 음수 모두 지닐 수 있습니다. 이렇게 나타내면 분류를 좀더 잘 할 수 있다고 논문에서는 이야기합니다.

또하나의 포인트

Weight의 scale에 구애받지 않고, 분류 작업을 할 수 있다는 장점 이외에도 cosine similarity를 사용하게 되면 feature extractor를 base category의 데이터냐 novel category의 데이터냐에 상관없이 잘 일반화 해서 학습할 수 있다고도 하네요. 그 이유인 즉슨, 추출된 feature activation이 ground truth label의 weight vector와 정확히 일치하도록 유도하기 때문에, $l_2$ 정규화 된 feature extract들의 inter class variation이 줄어드는 데에 도움을 주기 때문이라고 합니다. 그렇다기 보다는 그렇게 해석할 수 있다 정도가 맞겠습니다.

In fact, it turns out that our feature extractor trained solely on cosine-similarity based classification of base categories, when used for image matching, it manages to surpass all prior state-of-the-art approaches on the fewshot object recognition task.

Few-shot classification weight generator

1차로 학습된 결과를 바탕으로 새로운 범주에 대한 weight를 구하게 되는데, 이 과정에서, base category의 특성을 활용합니다. 새롭게 추가되는 $K_{novel}$개의 novel category에 대해 각각 $\phi$를 모수로 하는 classification weight generator, $G(.,.\vert \phi)$,가 있습니다. 만약 $n$ 번째 novel category의 $i$번째 데이터, $x’{n, i}$,가 에 들어가면 $z’{n,i}$를 얻습니다. 다시 말하면, 다음과 같습니다.

$z'_{n, i} = F(x'_{n,i} \vert \theta)$

이 $z’_{n,i}$와 함께 base category의 weight값을 input으로 사용하여 $G(.,.\vert \phi)$는 새로운 범주의 weight vector를 update합니다. 이런 과정을 새로운 범주 전체에 대해서 반복하고, 다음과 같은 새로운 범주에 대한 weight vector의 집합을 얻게 되는 것입니다.

$W_{novel} = \{ w'_n\}_{n=1}^{K_{novel}}, \textrm{ where, } w'_n = G(Z'_n, W_{base}\vert \phi)$

결국 우리는 base category와 함께 novel category에 대해 분류를 할 수 있는 weight 집합을 얻었습니다.

$C(\cdot \vert W^*), W^* = W_{base} \bigcup W_{novel}$

이미 학습되어 있는 base classifier의 weight를 모두 가지고 있을 것이므로 기존 범주에 대한 정확도의 훼손 없이 새로운 범주를 분류해 낼 수 있게 된다고 합니다.

Feature averaging based weight inference

Cosine similarity를 바탕으로 한 ConvNet을 쓰는 이유가 feature extractor를 통과한 feature vector가 각 class별로 군집해 있도록 하는 것이라고 위에서 설명을 했습니다. 이 아이디어에 근거해, 새로운 범주에 속하는 여러가지 $N’$개의 input으로부터 얻은 feature extract의 평균을 이용해서 weight를 구할 수 있습니다.

$w'_{avg} = \frac 1 {N'} \sum_{i=1}^{N'} \bar z'_i$

최종 weight는 다음과 같이 구합니다.

$w' = \phi_{avg} \odot w'_{avg}$

Attention-based weight inference

Attention은 요즘 새롭게 소개되는 신경망에는 약방의 감초처럼 등장하는 개념입니다. 단순 평균보다는 중요한 요소에 중요도를 더 주자는 건데, 어느 요소에 얼만큼 중요도를 부여할지를 또 다른 신경망이 판단하도록 합니다. $K_{base}$개의 base category에 대해 다음과 같이 attention을 구합니다.

$w'_{att} = \frac 1 {N'} \sum_{i=1}^{N'} \sum_{b =1}^{K_{base}} Att(\phi_q \bar z'_i, k_b)\cdot \bar w_b, \textrm{ , where }\phi_q \in \mathbb R^{d\times d}$

여기서 $\phi_q \in \mathbb R^{d\times d}$는 feature extractor를 query로 만들어주는 행렬로서 $\phi_q\bar z’_i$는 $k_b$라는 key vector와 함께 attention을 구해냅니다. Attention을 구할 때에도 dot product 대신에 cosine similarity를 사용합니다.

이렇게 구한 후에

$w' = \phi_{avg} \odot w'_{avg} + \phi_{att} \odot w'_{att}$

처럼 weight을 구합니다.

Training procedure

Train ConvNet

ConvNet을 학습시킨다고 하는 것은 $F(\cdot\vert \theta)$와 $C(\cdot \vert W^*)$를 학습시킨다는 것을 의미합니다. ConvNet과 few-shot classification weight generator $G(\cdot, \cdot \vert \phi)$를 학습시키기 위해 먼저 base category만으로 학습을 진행합니다. 이 단계는 두 단계로 이루어집니다. 각각의 단계에서 다음의 cross entrophy loss를 최소화합니다.

$\frac 1 {K_{base}}\sum_{b=1}^{K_{base}} \frac 1 {N_b} \sum_{i=1}^{N_b} loss (x_{b,i}, b)$

1st training stage:

모수를 $W^* = W_{base} = {w_b}{b= 1}^{K{base}}$로 설정하고 feature extractor와 classifier만을 학습합니다.

2nd training stage:

두번때 단계에서는 classification weight generator를 학습하는 단계입니다. 여기에서는 base classifier의 모수들을 고정시킨 채로 학습을 할 수도 있고, 모든 모수를 학습시킬 수도 있겠습니다. 가장 중요한 부분은 학습셋을 어떻게 구성하는가일 것입니다. 그렇게 많지 않은 숫자(논문에서는 일반적으로 novel category당 5개 미만의 example을 생각합니다.)의 novel category의 샘플 숫자에 맞추어, base category로부터 fake example을 matching시켜서 classification weight generator를 학습시키는 것이 핵심입니다.

아무리 봐도 문장이 이상한 것 같은데요. 어쨌든 attention mechanism을 사용할 때에는 attention mechanism에 이용된 base category는 학습셋에 포함하지 말라는 내용입니다. 구현해 가면서 정확한 의미를 생각해 보도록 하려고 합니다.

마치며

일단은 논문 전반에 대해서 훑으면서 큰 그림을 그려보았습니다. MAML이나 SNAIL보다는 훨씬 직관적으로 이해하기 쉬운 모형이라는 생각은 변함이 없습니다. cosine similarity를 마지막에 사용하는 면도 좀 재미있구요. 하지만, 구현이 그렇게 쉽지는 않을 것 같은 것이, training phase에 따라 서로 다른 data의 pipeline을 타야 하고, SNAIL만큼은 아닐지 몰라도 학습에 사용되는 데이터의 구조를 잘 만들어 내야 하다 보니, data loader의 프로그램이 쉽지는 않을 것 같다는 생각이 듭니다. 이상 짧지만 농도 짙은 글을 마치고, 다음에는 gluon으로 어떻게 구현할 수 있을지에 대해 다루도록 하겠습니다.

Sentiment Analysis - Self attention based on Relation Network

Thu, 02 Aug 2018 17:00:00 +0900

Introduction

There are many methods for sentence representation. We have discussed 5 different ways of sentence representation based on token representation. Let’s briefly summarize what is dealt with in the previous posts.

What we have discussed so far…

Just averaging token embeddings in sentence works pretty well on text classification problem. Text classification problem, which is relatively easy and simple task, does not need to understand the meaning of the sentence in semantic way but it suffices to count the word. For example, for sentiment analysis, the algorithm needs to count word that has siginificant relationship with positive or negative sentiments regardless of its position and meaning. Of course the algorithm should be able to learn the sentiment of word itself.

RNN

For better understanding of sentence, the order of words should be considered more importantly. For this, RNN can extracts information from a series of input tokens at hidden states as below:

$H = (h_1, \ldots, h_T), \quad h_i \in \mathbb R^d$

When we use those information, we frequently use the hidden state at the last time step only. It is not so easy to express all information from the sentence stored at only a small sized vector.

CNN

Borrowing the idea from $n$-gram techniques, CNN summarize local information around the token of interest. For this we can apply 1D convolution as is depicted in the following figure. This is just an example and we can try other different architecture too.

1D kernel of size 3 scan the tokens around the position we want to summarize information for. For this, we have to use padding of size 1 to keep the length of the feature map after filtering the same as the original length $T$. The number of output channel is $c_1$, by the way.

Another fiter is be applied to the feature map and the input is finally transformed into $c_2 \times T$. This series of process mimics the way human read the sentence, by understanding meaning of 3 tokens and then combine them to understand higher level concepts. As a side product, we can enjoy much faster computation using well-optimized CNN algorithms implemented in deep learning frameworks.

Relation network

The pair of words may give us more clear information about the sentence. There are many cases where a word may have different meaning depending on the usage. for example, the word ‘like’ in ‘I like’ is different from that in ‘like this’. If we consider ‘I’ and ‘like’ together, we can be more clear about the sentiment of sentence then the case where we use ‘like’ and ‘this’ together. It is definitely positive signal. Skip gram is a technique to retrieve infromation from the pair of words. It does not have to be adjacent pairs. It allow the gap between them as the word ‘skip’ suggests.

As you can see from the above figure, a pair of tokens are fed into a function $f(\cdot)$ to extract the relation between them. For a fixed position, $t$, the $T-1$ pairs are summarized, via sum or average or through any other relavant techeniques, for sentence representation.

A need for compromise

We can write down those three different approaches in a single general form as below:

$h_t = I_{t, 1}f(x_t, x_{1}) + \cdots + I_{t, (t-1)}f(x_t, x_{t-1}) + I_{t, (t+1)}f(x_t, x_{t+1}) + \cdots + I_{t, T}f(x_t, x_{T})$

With all $I_{t\cdot}$’s being 1, the general form says that any skip bigrams evenly contribute to the model.

In the case of RNN, we ignore any information after the token $x_t$, so the above equation reduces to

$h_t = f(x_t, x_{t-k}) + \cdots + f(x_t, x_{t-1}).$

With bidirectional rnn, we can consider backward relation from $x_T$ to $x_t$ though.

On the other hand, CNN browse information only around the token of interest, if we only cares about $k$ tokens before and after token $x_t$, the general formula can be re-arranged as below:

$h_t = f(x_t, x_{t-k}) + \cdots + f(x_t, x_{t-1}) + f(x_t, x_{t+1}) + \cdots + f(x_t, x_{t+ k})$

While relation network can be too big to consider all pairwise relationship of tokens, CNN can be too small to consider only local relationship between them. We need a compromise in between those two extreme, which is so called attention mechanism.

Self-Attention

A general form given in the previous paragraph can be re-written in a more flexible form as follows:

$h_t = \sum_{t' = 1}^T \alpha(x_t, x_{t'}) f(x_t, x_{t'})$

Here, $\alpha(\cdot,\cdot)$ controls the amount of effect that each pairwise combination of tokens may have. For example, two tokens, ‘I’ and ‘you’, in the sentence ‘I like you like this’, may not contribute to the decision on its sentiment. Contrarily, ‘I’ and ‘like’ combination gives us a clear idea about the sentiment of the sentence. In this case we pay little attention to the former and significant attention to the latter. By introducing the weight vector $\alpha(\cdot, \cdot)$, we can let the algorithm to adjust the importance of the word combination.

Supposing that $T$ tokens in the $i$-th sentence are embedded in $H_{i1}, \ldots, H_{iT}$, each token embedding will be assigned to a weight $\alpha_{it}$, which represents relative importance when tokens are summarized into a single representation. For this attention vector to address relative importance of word combinations, the attention weights must satisfy

$\sum_{t = 1} ^T \alpha_{i, t} = 1$

and this property is achieved by inserting soft-max layer as a node in the network.

The final product we want to have at the end of the day is a weight matrix per input sentence. If we have 10 sentence feed into network, we will get 10 attention matrices that look like this.

Self-Attention implementation

The self-attention mechanism was first proposed in the paper, A structured Self-Attentive Sentence Embedding, which applied self-attention mechanism to the hidden layer of bidirectional LSTM as shown in the following figure.

It, however, does not have to be LSTM for token representation (not really token representation, what I mean by this is pre-sentence representation stage) and we will apply self-attention mechanism to token representation based on relation network.

Different from Self-attention mechanism from the original paper (given in the above figure, mathemtical details can be found in my previous post, here), attention mechanism for relation network can be defined as

To explain the above diagram, let’s assume that we want to get a representation for the $i$-th token. For combinations of tokens with the $i$-th token, there are two outputs are produced: one of them is used for feature extraction (green circle) and the other is used for attention weight(red circle). Those two outputs may share the network, but in this article, we use separate network for each output. The output for the attnetion (red circle) runs through sigmoid and softmax layer before we get the final attention weights. These attention weights are multiplied to the extracted features to get the representation for a token of interest.

Self-Attention with Gluon

For the implementation, we assume very simple network with two fully connected dense layers for relation extractor and one dense layer for attention, which is followed by another two fully connected dense layeyrs for the classifier. Here, relation extractor and attention extractor is given the following code snippet.

class Sentence_Representation(nn.Block):
    def __init__(self, **kwargs):
        super(Sentence_Representation, self).__init__()
        for (k, v) in kwargs.items():
            setattr(self, k, v)
        
        with self.name_scope():
            self.embed = nn.Embedding(self.vocab_size, self.emb_dim)
            self.g_fc1 = nn.Dense(self.hidden_dim,activation='relu')
            self.g_fc2 = nn.Dense(self.hidden_dim,activation='relu')
            self.attn = nn.Dense(1, activation = 'tanh')
            
    def forward(self, x):
        embeds = self.embed(x) # batch * time step * embedding
        x_i = embeds.expand_dims(1)
        x_i = nd.repeat(x_i,repeats= self.sentence_length, axis=1) # batch * time step * time step * embedding
        x_j = embeds.expand_dims(2)
        x_j = nd.repeat(x_j,repeats= self.sentence_length, axis=2) # batch * time step * time step * embedding
        x_full = nd.concat(x_i,x_j,dim=3) # batch * time step * time step * (2 * embedding)
        # New input data
        _x = x_full.reshape((-1, 2 * self.emb_dim))
        
        # Network for attention
        _attn = self.attn(_x)
        _att = _attn.reshape((-1, self.sentence_length, self.sentence_length))
        _att = nd.sigmoid(_att)
        att = nd.softmax(_att, axis = 1)
        
        _x = self.g_fc1(_x) # (batch * time step * time step) * hidden_dim
        _x = self.g_fc2(_x) # (batch * time step * time step) * hidden_dim
        # add all (sentence_length*sentence_length) sized result to produce sentence representation

        x_g = _x.reshape((-1, self.sentence_length, self.sentence_length, self.hidden_dim))
    
        _inflated_att = _att.expand_dims(axis = -1)
        _inflated_att = nd.repeat(_inflated_att, repeats = self.hidden_dim, axis = 3)

        x_q = nd.multiply(_inflated_att, x_g)

        sentence_rep = nd.mean(x_q.reshape(shape = (-1, self.sentence_length **2, self.hidden_dim)), axis= 1)
        return sentence_rep, att

We have separate networks for feature extraction and attention. The resulting attention vector is of size $T\times 1$ and the resulting feature extraction is of size $T\times d$, where $d$ is a sort of hyper parameter. To multiply those two, we simply inflate attention vector to match the size of feature extraction. It’s just a trick and other implementations could be better. The entire implementation can be found here

Result

Here is attention matrix for 9 randomly selected attention matrices.

We can understand what tokens the algorithm pay attention to when it classifies the text. As is expected, sentiment words such as ‘love’, ‘awesome’, ‘stupid’, ‘suck’ got some spotlight during classification process.

WGAN and WGAN-GP

Thu, 26 Jul 2018 17:00:00 +0900

Introduction

It has been a while since I posted articles about GAN and WGAN. I want to close this series of posts on GAN with this post presenting gluon code for GAN using MNIST.

GAN is notorious for its instability when train the model. There are two main streams of research to address this issue: one is to figure out an optimal architecture for stable learning and the other is to fix loss function, which is considered as one of the primary reasons for instability. WGAN belongs to the latter group defining a new loss function based on a different distance measure between two distributions, called Wasserstein distance. Before, in original vanilla GAN paper, it was proven that adversarial set of loss functions is equivalent to Jenson-Shannon distance at optimal point. For more detailed information about GAN, please refer to Introduction to WGAN

Wasserstein distance

A crucial disadvantage of KL(Kullback - Leibler) divergence based metric(Jenson - Shannon distance is just an extention of KL distance to more than two distributions) is that it can be defined only for the distributions that share the same support. If it is not the case, those metrics explodes or be a constant so that they cannot represent the right distance between distributions. WGAN paper has a nice illustration on this and if you need more detailed explanation, you can read this post.

This problem was not a big problem in classification tasks since, entropy-based metric for categorical response has limited number of categories and ground-truth distribution and its estimator must share the support. It is totally different story for generation problem since we need to generate a small manifold in a huge original data space. Needlessly to say, it must be very hard for a set of tiny manifolds to share their support. Let’s think about MNIST. Assuming gray image, images dwell in $255^{784}$ dimensional space but the size of collected data at hand is 60,000. I cannot tell precisely but meaningful images that look like hand-written numbers are rare in the entire space of 28 $\times$ 28 sized images.

Wasserstein distance can measure how much distributions are apart even when those distributions do not share supports. It is very good thing but calculating Wasserstein distance is not easy to get since it involves another optimization problem itself. Kantorvich-Rubinstein duality tweaks the original optimization problem into a much simpler maximization problem under a certain constraint. The main idea of WGAN is that neural network can be used for finding accurate Wasserstein distance. Here is a formal definition of Wasserstein distance.

$\sup E_{X\sim P_r}(f_w(X)) - E_{X \sim P_\theta}(f_w(X))$

To get more sense, I just depicted what the above equation means at below.

The function $f$ in the above figure is just an example. What we will going to do is to search the function that maximizes expection amongst all possible $K$ - Lipschitz functions. It must be very extensive work and the authors of WGAN suggested let neural network takes care of it.

In a sense that WGAN tells us a real numbered distance between real and generated data’s distribution, WGAN can be thought of as a more flexible version of GAN that just say yes or no for the question “Are two distributions the same?”.

Critic vs Discriminator

WGAN introduces a new concept critic, which corresponds to discriminator in GAN. As is briefly mentioned above, the discriminator in GAN only tells if incoming dataset is fake or real and it evolves as epoch goes to increase accuracy in making such a series of decisions. In contrast, critic in WGAN tries to measure Wasserstein distance better by simulating Lipschitz function more tightly to get more accurate distance. Simulation is done by updating critic network under implicit constraint that critic network satisfies Lipschitz continuity condition.

If you look at the final algorithm, they, GAN and WGAN, look very similar to each other in algorithmic point of view, but their derivation and role is quite different as much as variational auto encoder is different from autoencoder. One fascinating thing is that the derived loss function is even simpler than the loss function from the original GAN algorithm. It’s just difference between two averages. The following equation is taken from the algorithm in the original WGAN paper.

Critic implementation

The entire algorithm is given below. Especially critic implmentation is highlighed with pink box. When a set of data is given, the algorithm first compares with a set of generated images. To get more accurate distance, it iterates through several steps for critic network to end up with the maximum difference of expectations from real and fake data, which is wasserstein distance. It my fail to find exact distance, but we want to be as close as possible.

The relevant part of the implementation looks like this. (It’s gluon)

for j in range(n_critic_steps):
    latent_z = nd.random.normal(shape =(batch_size, latent_dim), ctx = ctx) # Latent data
    fake_data = gen(latent_z)
    c_real = nd.mean(critic(real_data))
    c_fake = nd.mean(critic(fake_data))
    with autograd.record():
        c_cost = - (c_real - c_fake)
        c_cost.backward()
    critic_trainer.step(batch_size)
    # Weight clipping
    [critic.collect_params()[p].set_data(nd.clip(critic.collect_params()[p].data(), -clip, clip)) for p in critic.collect_params()]              

According to the definition of Wasserstein distance, we need to maximize the expectations under two different distributions. For utilizing built-in optimizers in Gluon, we defined cost as negative of the value we want to maximize.

At the end of each critic update steps, to make sure that a function, the critic network surrogates, satisfies Lipschitz continuity condition, the weights are clipped not to let critic network violate Lipschitz condition. The authors didn’t like this heuristic approach though.

Since the first part of Wasserstein distance does not involve generator network’s parameter $\theta$, we can ignore the first part of Wasserstein distance.

Only considering the latter part, we can update the generator network as follows:

latent_z = nd.random.normal(shape = (batch_size, latent_dim), ctx =ctx)
with autograd.record():
    fake = gen(latent_z)
    g = nd.mean(critic(fake))
    g_cost = - g
    g_cost.backward()
gen_trainer.step(batch_size) 

The entire code can be found in git repository

Penalization

In the original WGAN, the Lipschitz constraint was exposed using weight clipping and there was an obvious room for improvement. Instead, the authors in Improved Training of Wasserstein GANs proposed to expose penalty on the norm of weights from critic network. It is one of natural way to control the magnitude of weight matrix to make critic network satisfies Lipschitz condition. The following code shows “penalty part” from the entire implementation.

def get_penalty(critic, real_data, fake_data, _lambda):
    from mxnet import autograd
    alpha = nd.random.uniform(shape = (batch_size,1))
    alpha = nd.repeat(alpha, repeats = real_data.shape[1], axis = 1)
    alpha = alpha.as_in_context(mx.gpu())

    interpolates = alpha * real_data + ((1 - alpha) * fake_data)
    interpolates.attach_grad()
    with autograd.record():
        z = critic(interpolates)
    z.backward()

    gradients = interpolates.grad
    gradient_penalty = nd.mean(nd.array([(x.norm()**2 - 1).asscalar() for x in gradients], ctx = ctx)) * _lambda
    
    return gradient_penalty

The rest of the algorithm is exactly the same as that of WGAN.

Results and thoughts

After 400 epochs, I just printed the generated image. Even after 400 epochs, I could not get perfact hand-written number images yet.

According to my experience, those two algorithms seem to be comparable. My personal feeling is that it’s still very hard to generate an images even with WGAN and improved WGAN.

Sentiment Analysis - Convolutional Neural Network

Wed, 18 Jul 2018 17:00:00 +0900

Introduction

Let’s think about the way human understand sentence. We read the sentence from left to right (it is not the case in the ancient asisan culture though) word by word memorizing the meaning of words first. Words themselves may have very different meaning depending where they are placed or how they were used. To understand real meaning of words, we break the sentence down into smaller phrases, groups of words, to get the right meaning of words in the context of sentence. Lastly, we weave the meanings from phrases to understand the sentence finally. How to mimic this behavior or reading?

Recurrent network is not enough

Recurrent neural network models the way human’s reading behavior by taking the sentance as sequence of words(possibly, token can be better expression, here we stick with word) in order. It calculates conditional probability given the previously read words. Especially, LSTM can adjust itself the amount of memory for each word to get best understanding of sentence. RNN also can be used to model hierarchical way of understanding sentence (word - phrase - sentence - paragrph structure) by stacking layers.

Bidirectional RNN can be another option for better understanding the sentence. From time to time, the word at the end of sentence can be helpful in understanding the words located in the earlier part of sentence. Bidirectional RNN allows memory cells to collect information from the back to front of sentence. By concatenating RNN cells from both forward and backward direction, meaning of words get clearer than just using single RNN cell.

One of the biggest issues with RNN is speed and parallelization. Sequential nature of RNN prevents it from parallel programming and it ends up with very slow training speed. To make it even worse, memory cells have many parameters compared to CNN and it is another source of inefficiency.

CNN can do something about it.

CNN is well-known for picking spatial information and widely used for image related tasks. Understanding sentence in hierachical manner can be considered as a process of recognizing low-level local feature and abstracting them into higher concept. So why not using CNN in sentence reprentation?

Adidtionally, as CNN utilize only words around the word that the algorithm focusing on, we can easily break down into pieces and train those pieces in parallel. So Kim et al. (2014) proposed a simple algorithm that employ CNN for sentiment analysis. Let’s understand some detail about it.

CNN architecture for sentiment analysis.

In this article, we will implement Kim et al. (2014). Not exactly but very similarly keeping their idea.

The following visual came from the paper and if you understand this clearly, I think you are almost there.

NOTE: Based on my personal experience, most of papers are not kind enough to tell every detail about their idea and it is very hard to implment their idea correctly without those implicit elements hardly found in the original paper. This paper, however, seems to be relatively straightforward to implement.

More fancy plot with the same idea can be found below.

It consitst of couple of 1D convolution layer with different kernel size on word embeddings. By doing this, we can retrieve information from various word groups.

The feature maps obtained by applying 1D convolution layers sequentially from the start to the end of sentence are fed into max-pooling layer to summarize those $N - k + 1$, feature maps into single number. Here $N$ is the number of words in sentence and $k$ is the size of 1D convolution filter. Concatenating those outputs from max-pooling layer (just a scalar), we get a vector as long as number of 1D convolution layers and it is going to be input for a classifier architected with fully connected layer.

Before dive into detail of gluon implementation, let’s consider dimensionality of embedding and feature maps. After data going through embedding layer, for each sentence, we have two dimensional matrix of size $N \times e$, where $N$ is the number of words in sentence (the same as defined above) and $e$ is the dimensionality embedded each word into. That means each row means a embed word and we have word-many rows in the matrix. In gluon, there is no way to apply 1D convolution layer for matrix. So, even though it is 1D convolutional layer that we need for convolution, we have to use 2D convolutional layers with appropriate kernel size defined to act as if it is 1D convolutional layers.

For this, if we set the width of kernel as embedding size, then there is no room for 2D convolution layer to convolve with the data more than 1 time and the kernels are applied only in the direction of words.

Here is the way how convolution layers defined. Only relavant part of the code is displayed below and the working code is given gluon implementation

class Sentence_Representation(nn.Block):
    def __init__(self, **kwargs):
        super(Sentence_Representation, self).__init__()
        for (k, v) in kwargs.items():
            setattr(self, k, v)
        with self.name_scope():
            self.embed = nn.Embedding(self.vocab_size, self.emb_dim)
            self.conv1 = nn.Conv2D(channels = 8, kernel_size = (3, self.emb_dim), activation = 'relu')
...

In this article, we used 4 convolution layers with kernel size 3,4,5, and 6. Each kernels has 8 channels and we have 8$\times$ 4 kernels $=$ 32 nodes as input for classifier.

def forward(self, x):
        embeds = self.embed(x) # batch * time step * embedding
        embeds = embeds.expand_dims(axis = 1)
        _x1 = self.conv1(embeds)
        _x1 = self.maxpool1(_x1)
        _x1 = nd.reshape(_x1, shape = (-1, 8))
        
        _x2 = self.conv2(embeds)
        _x2 = self.maxpool2(_x2)
        _x2 = nd.reshape(_x2, shape = (-1, 8))
        
        _x3 = self.conv3(embeds)
        _x3 = self.maxpool3(_x3)
        _x3 = nd.reshape(_x3, shape = (-1, 8))
        
        _x4 = self.conv4(embeds)
        _x4 = self.maxpool4(_x4)
        _x4 = nd.reshape(_x4, shape = (-1, 8))

        _x = nd.concat(_x1, _x2, _x3, _x4)

THe dimensionality of embedding is $B\times N \times e$, where $B$ is batch size. When we feed word embeddings to convolution layers, we have to expand dimension of embedding since 2D conv layer takes 4-dimensional array as its input, specifically, $B\times C \times H \times W$, where $C$ means channel, $H$ means height, and $W$ means width. (batch size $\times$ channel $\times$ height $\times$ width). As is described above, we consider $N$ as $H$, and $e$ as $W$. We are missing channel part of convolution input, and we just put 1 as input channel by expand dimension in the axis of 1 since we don’t have channel for text.

Result

It still shows excellect performance of accuracy 0.99. There are just 35 sentences misclassified and some of them look like as follows:

that not even an exaggeration and at midnight go to wal mart to buy the da vinci code which be --- Label:1.0
ok time to update wow have update for a long time ok so yeah watch over the hedge and mission --- Label:1.0
hey friends know many of be wonder where have be well last week go to a special screening of mission --- Label:1.0
hate though because really like mission impossible film so feel bad when go see in theater since put money in --- Label:1.0
mission impossible do kick ass and yes jessica be pretty damn dumb --- Label:1.0
harry potter and the philosopher stone rowling strangely a fan of hp fanfic but not of the book --- Label:1.0
child text fantasy perhaps most obviously be often criticize for oversimplify the struggle of good evil harry potter may be --- Label:1.0
for that but since be fault be into harry potter --- Label:1.0
harry potter be good but there only of those --- Label:1.0
also adore harry potter and hogwarts --- Label:1.0
well at least harry potter be real good so far --- Label:1.0
decide how want the harry potter to go --- Label:1.0
harry potter the goblet of fire be good folow on from the other movie --- Label:1.0
keep gettt into little want harry potter fit and have to watch which also can wait till in out in --- Label:1.0
harry potter be a story of good conquer evil and the friendship that be form along the way --- Label:1.0
and snuck in and go to springer brokeback mountain be finally see --- Label:1.0
think people be more tired of the mission impossible franchise than tom cruise --- Label:0.0
so run off because hated top gun mission impossible cocktail --- Label:0.0

I will leave it as a question for the readers of this article if those erroneous sentences deserve or not.

Reference

https://arxiv.org/pdf/1510.03820.pdf
http://www.aclweb.org/anthology/D14-1181
http://docs.likejazz.com/cnn-text-classification-tf/#fn:fn-3

Sentiment Analysis- Self Attention

Thu, 12 Jul 2018 17:00:00 +0900

시작하며

LSTM을 이용해서 문장의 여러 특성들을 뽑을 수 있습니다. 지난 블로그들에서는 주로 hidden state의 정보를 이용해서 문장을 표현하는 코드들을 짜보았는데, 사실 hidden state의 정보 이외에도 각 time step의 ㅡ로을 이용해서 문장을 요약할 수도 있을 것 같습니다. 하지만 각 time step의 output은 seq2seq 문제에서 실제 그 진가를 발휘합니다. 이전 단계의 output이 그 다음 time step의 input으로 들어감으로써, 순차적으로 문장을 생성할 때 유용하게 사용됩니다. 나중에 nmt 쪽에서 살펴 보려고 합니다.

Self Attention

지금까지 여러가지 모형을 만들어 보면서 가장 아쉬웠던 점은 그동안의 모형들을 과연 그 모형이 어떠한 mechanism에 의해 작동하는지 확인을 할 수 없었다는 것입니다. 지금까지 다룬 데이터는 더군다나 너무 성능이 좋은 모형들로 구성이 되어 있어서 어느 모형이 어떻게 동작하는지 잘 알 수 없었습니다. 이런 문제점을 해결하기 위해 좋은 tool이 있습니다. 2017년 Zhouhan Lin 외 다수가 작성한 논문에서 최초로 소개된 Self-attention이라는 개념은, seq2seq에서 사용된 attention을 변형한 형태로, 각 token이 실제 분류 결과에 어떠한 영향을 미쳤는지를 attention의 weight 형태로 보여줍니다.

Self Attention의 구조

논문의 아이디어는 대부분 하나의 그림으로 이야기됩니다. 다음은 원래 논문에 수록되어 있는 전반적인 아이디어를 나타내는, 논문 전체를 대변하는, 그림입니다.

결국 classifier의 입력으로 들어가는 데이터는 그림 상에서 M이라는 행렬입니다. M 행렬은 $r \times (2\cdot u)$의 크기를 지닙니다. 여기서 $r$은 우리가 몇개의 관점에서 문장을 요약할지를 나타내고, $u$는 LSTM layer의 hidden dimension입니다. Bidirectional LSTM이므로 $2\cdot u$만큼의 크기가 된 것입니다. 최종적으로 $M$이라는 결과물을 얻기 위해서는 다음의 변환 과정을 거칩니다.

$M = A H = \left\{ softmax(W_2 \tanh (W_1 H^T)\right\}\cdot [H_{forward}, H_{backward}]$

논문에 나와 있는 식을 한꺼번에 쓴 식이고, 이는 위의 그림에 이미 충분히 설명되어 있습니다. 그래도 다시 한번 설명해 보자면…

Embedding 과정은 NLP에서는 거의 필수 요소가 되어가고 있습니다. 미리 train되어 있는 embedding을 쓰는지, 해당 문제에 맞는 embedding을 network 안에서 직접 학습해서 쓰는지의 차이인 것 같습니다. Embedding의 dimension을 $e$라고 하겠습니다. 먼저 embedding layer를 통과히여 $e$ 차원의 벡터들로 표현된 token들은 LSTM layer를 2번 거칩니다. 하나는 순방향의 문장 의미를 해석하는 forward LSTM이고, 다른 하나는 역방향의 문장 의미를 해석하는 backward LSTM입니다. 두 LSTM layer는 각각 $u$ 차원의 hidden vector를 갖고 있습니다. 각각의 time step에서 문장을 요약하기 위해서 두개의 hidden 벡터를 이어 붙입니다 그래서 각각의 time step의 정보는 $2u$ 크기의 벡터로 요약이 되는 것이지요. 모든 time step을 이어붙여서 하나의 행렬 형태로 만들어진 것이 $H$라는 행렬입니다. $H$ 행렬의 크기는 $n \times 2u$가 되겠습니다. 결국 문장은 $n$개의 token으로 이루어져 있다고 가정하는 셈입니다.

개인적인 소회를 쓰자면, Computer science에서는 행과 열에 대해 신경을 많이 쓰지 않는 것 같습니다. 행과 열이 정확하게 일치하는 논문이나 글들을 본 적이 없는 것 같습니다. 그냥 찰떡같이 알아먹나 봅니다. 엄밀하게 행과 열을 맞추다 보면 일치하지 않아서 시간을 많이 보낸 경험이 참 많습니다.

이런 $H$를 바탕으로 이제 각 time step에서 $2u$ 크기의 벡터로 요약된 정보들에 중요도(attention이라고 하면 이상할까요?)를 어떻게 주는지를 결정하기 위한 attention vector를 계산합니다.

Attention 벡터는 일단 $W_1$이라는 행렬을 곱해서 선형 변환을 진행합니다. 각각의 time step에서 Hidden state인 $2u$ 크기의 벡터 에서 $d_a$의 크기로 줄이는 작업을 진행합니다. 이렇게 $d_a$ 차원으로 줄어든 벡터에 비선형 변환 $\tanh$를 적용한 후에 다시 한번 $r$ 차원으로 줄입니다. 이렇게 $r$ 차원으로 줄어든 벡터는 주어진 time step에서의 token이 지니는 $r$개의 서로 다른 관점에서의 주요도라고 볼 수 있습니다. 이렇게 얻어진 $r\times n$ 행렬은 열의 방향으로 softmax를 적용시켜 결국 최종적인 $r$-hop self-attention vector를 얻게 됩니다.

실제 구현에서는…

Self attention을 구하기 위해서는 2개의 새로운 행렬이 등장합니다. $W_1$과 $W_2$는 각각 $d_a \times 2u$ $r \times d_a$의 크기를 지닙니다. 이 두 행렬은 각각 $2u$와 $d_a$를 input으로 하고, $d_a$와 $r$을 output으로 하는 dense layer의 weight값으로 볼 수 있습니다. 이 때 두 dense layer는 모두 bias term은 없어야겠죠.

이렇게 놓고 보면, 각각의 time step은 모두 독립적으로 볼 수 있습니다. $W_1$과 $W_2$는 각각의 time step에 따로 적용되는 것입니다. 구현 상에서는 $n$개의 time step을 example 개념으로 볼 수도 있을 것 같습니다. 그래서 다음과 같이 구현을 하게 됩니다.

class Sentence_Representation(nn.Block):
    def __init__(self, **kwargs):
        super(Sentence_Representation, self).__init__()
        for (k, v) in kwargs.items():
            setattr(self, k, v)
        
        self.att = []
        with self.name_scope():
            self.f_hidden = []
            self.b_hidden = []
            self.embed = nn.Embedding(self.vocab_size, self.emb_dim)
            self.drop = nn.Dropout(.2)
            self.f_lstm = rnn.LSTMCell(self.hidden_dim // 2)
            self.b_lstm = rnn.LSTMCell(self.hidden_dim // 2)
            self.w_1 = nn.Dense(self.d, use_bias = False)
            self.w_2 = nn.Dense(self.r, use_bias = False)

    def forward(self, x, _f_hidden, _b_hidden):
        embeds = self.embed(x) # batch * time step * embedding
        f_hidden = []
        b_hidden = []
        self.f_hidden = _f_hidden
        self.b_hidden = _b_hidden
        # Forward LSTM
        for i in range(embeds.shape[1]):
            dat = embeds[:, i, :]
            _, self.f_hidden = self.f_lstm(dat, self.f_hidden)
            f_hidden.append(self.f_hidden[0])
        # Backward LSTM
        for i in np.arange(embeds.shape[1], 0, -1):
            dat = embeds[:, np.int(i - 1), :] # np.int() necessary
            _, self.b_hidden = self.b_lstm(dat, self.b_hidden)
            b_hidden.append(self.b_hidden[0])
        
        _hidden = [nd.concat(x, y, dim = 1) for x, y in zip(f_hidden, b_hidden)]
        h = nd.concat(*[x.expand_dims(axis = 0) for x in _hidden], dim = 0)
        h = nd.transpose(h, (1, 0, 2)) # Batch * Timestep * (2 * hidden_dim)
        
        # get self-attention
        _h = h.reshape((-1, h.shape[-1]))
        _w = nd.tanh(self.w_1(_h))
        w = self.w_2(_w)
        _att = w.reshape((-1, h.shape[1], w.shape[-1])) # Batch * Timestep * r
        att = nd.softmax(_att, axis = 1)
        self.att = att # store attention 
        x = gemm2(att, h, transpose_a = True)  # h = Batch * Timestep * (2 * hidden_dim), a = Batch * Timestep * r
        return x, att

위에서 $h$는 LSTM에 의해서 요약되어 나온 hidden state을 의미하고 $n$개의 $2u$ 벡터가 됩니다. 이 벡터를 바탕으로 w_1 layer를 적용시킬 때에는 각각의 time step을 하나의 example로 보기 위해 reshape을 진행해 줍니다. attention을 구할 때는 Batch size($B$라고 표현하겠습니다.!)만큼의 example이 있는 것이 아니라, $B\times n$만큼의 example을 network의 input으로 활용합니다.

결과물

Attention을 시각해 하기 위해서 다음의 함수를 정의합니다.

def plot_attention(net, n_samples = 10, mean = False):
    from matplotlib import pyplot as plt
    import seaborn as sns
    sns.set()
    idx = np.random.choice(np.arange(len(va_x)), size = n_samples, replace = False)
    _dat = [va_x[i] for i in idx]
    
    w_idx = []
    word = [[idx2word[x] for x in y] for y in _dat]
    original_txt = [va_origin[i] for i in idx]
    out, att = net(nd.array(_dat, ctx = context)) 

    _a = []
    _w = []
    for x, y, z in zip(word, att, original_txt):
        _idx = [i for i, _x in enumerate(x) if _x is not 'PAD']
        _w.append(np.array([x[i] for i in _idx]))
        _a.append(np.array([y[i].asnumpy() for i in _idx]))
        
    _label = [va_y[i] for i in idx]
    _pred = (nd.sigmoid(out) > .5).asnumpy()
    
    fig, axes = plt.subplots(np.int(np.ceil(n_samples / 4)), 4, sharex = False, sharey = True)
    plt.subplots_adjust(hspace=1)
    if mean == True:
        fig.set_size_inches(20, 4)
        plt.subplots_adjust(hspace=5)
    else:
        fig.set_size_inches(20, 20)
        plt.subplots_adjust(hspace=1)
    cbar_ax = fig.add_axes([.91, .3, .04, .4])
    
    
    
    for i in range(n_samples):
        if mean == True:
            _data = nd.softmax(nd.array(np.mean(_a[i], axis = 1))).asnumpy()
            sns.heatmap(pd.DataFrame(_data, index = _w[i]).T, ax = axes.flat[i], cmap = 'RdYlGn', linewidths = .3, cbar_ax = cbar_ax)
        else:
            sns.heatmap(pd.DataFrame(_a[i], index = _w[i]).T, ax = axes.flat[i], cmap = 'RdYlGn', linewidths = .3, cbar_ax = cbar_ax)
        axes.flat[i].set_title('Label: {}, Pred: {}'.format(_label[i], np.int(_pred[i])))

$r$개의 attention을 각각 plotting할 수도 있지만, $r$개의 attention을 모두 더한 후에 softmax를 다시 적용해서 새로운 attention 벡터를 구할 수도 있게 했습니다. $r$ 개의 벡터를 그려봤는데, 워낙 짧은 문장이어서 그런지 별로 insight가 보이지는 않았습니다. $r$개의 attention 벡터를 구했던 이유도 애초에 문장이 길어지거나 복잡해질 때 여러가지 관점에서 문장을 요약하는 목적이었습니다. 이 문제에는 해당하지 않는다고 보이네요.

어쨌든 random하게 고른 12개의 결과는 다음과 같습니다.

몇몇 결과에서 hate, suck라는 단어가 나오면 negative sample로 분류하는 것을 볼 수 있습니다. 반면 positive sample로 분류할 때는 그렇게 큰 특징을 볼 수 없습니다. 몇번의 다른 시도에서는 awesome, want등의 단어들이 positive sample로 분류될 때 높은 attention을 보이는 것을 보기도 했습니다. 보다 많은 샘플들을 이용해서 오랫동안 학습하면 좀더 나아질까요?

다음 글에서는 relation network에 근거한 self-attention에 대해서 구현해보고 알아보겠습니다.

Sentiment Analysis- Bidirectional LSTM

Thu, 05 Jul 2018 17:00:00 +0900

시작하며

LSTM에서 1개의 LSTM layer를 이용해서 문장을 표현하는 방법을 지난 블로그에서 알아보았습니다. 말씀드린 것처럼 sentiment analysis를 위한 정보를 문장으로부터 뽑아내는 방법에는 여러가지가 있습니다. 오늘은 Bidirectional LSTM을 이용하는 방법에 대해서 알아보겠습니다. 두가지 방법을 알아볼텐데요. 하나는 LSTMCell을 이용해서 직접 Bidirection LSTM을 구현하는 방법과 다른 하나는 LSTM Layer를 이용해서 간단하게 구하는 방법입니다.

그렇다면 간단하게 할 수 있는 작업을 왜 어렵게 두가지로 나눠서 보느냐? 최근 트렌드는 어떻게 보면 attention mechanism이 주도한다고 해도 과언이 아닙니다. 이렇게 hot한 Attention mechansim을 사용하기 위해서는 각 time step에서의 hidden state 값들을 모두 뽑아내야 하는데요. 그게 LSTM layer만을 사용해서는 얻기가 어렵습니다. 그 때는 LSTMCell을 사용해야 합니다.

이번 글은 Sentiment analysis 자체에 대한 내용이라기 보다는 gluon의 LSTM API 같은 성격을 지니겠네요.

Architecture

이번에는 다음과 같이 Bidirectional LSTM을 통해서 문장을 표현하는 방법을 생각해 보겠습니다.

Googling을 해보니 위와 같은 구조가 보입니다. 대체적으로는 맞는 그림인 것 같습니다만, 오류가 있습니다. 마지막 time step의 경우 역방향 LSTM은 충분히 문장 전체를 학습하지 못한 채 문장을 구분하는 데에 사용이 되고 맙니다. 사실 잘못된 그림이죠. 생각보다 잘못된 그림들, 그래서 실제로 구현에 있어서 방해가 되는 그림들이 많습니다. 제가 그려보니깐 사실 오류 없이 그리기도 쉽지는 않습니다. 개개인의 경험이 중요한 이유인 것 같습니다.

다음 그림이 더 정확하다고 보겠습니다.

그래서 forward LSTM의 경우에는 마지막 time step의 hidden state값을, reverse LSTM의 경우에는 첫번째 time step의 hidden state값을 가져 오는 게 맞습니다. Gluon으로 구현을 해보도록 하죠.

Bieirectional LSTM은 그냥 독립적인 두개의 LSTM

전체 코드는 Bidirectional LSTM를 참조하시기를 바랍니다.

말한 것과 같이 가장 쉽게는 두개의 LSTM을 구해주면 됩니다. 단지 방향이 다를 뿐이지요. 가장 쉽게 Bidirectional LSTM을 구현하는 방법은 LSTM layer의 bidirection option을 사용하는 것입니다.

self.lstm = rnn.LSTM(HIDDEN_DIM, dropout = dropout \
                               , input_size = EMB_DIM \
                               , bidirectional = True \
                               , layout = 'NTC')

그 결과 hidden state는 LSTM layer 개수의 두배만큼 돌려줍니다. 얻어진 행렬은 (2 $\times$ Batch size $\times$ Hidden Layer)가 됩니다. 그러므로, classifier에 bidirectional LSTM의 결과물을 넣을 때에는 reshape을 해주어야 합니다.

class SA_Classifier(nn.Block):
    def __init__(self, sen_rep, classifier, batch_size, context, **kwargs):
        super(SA_Classifier, self).__init__(**kwargs)
        self.batch_size = batch_size
        self.context = context
        with self.name_scope():
            self.sen_rep = sen_rep
            self.classifier = classifier
            
    def forward(self, x):
        hidden = self.sen_rep.lstm.begin_state(func = mx.nd.zeros, batch_size = self.batch_size, ctx = self.context)
        _, _x = self.sen_rep(x, hidden) # Use the last hidden step
        # Extract hidden state from _x
        _x = nd.reshape(_x[0], (-1, 2 * _x[0].shape[-1]))
        x = self.classifier(_x)
        return x  

forward method의 마지막에서 두번째 라인이 그 라인에 해당합니다.

LSTMCell

LSTMCell은 좀더 advanced 된 API로 LSTM을 좀더 세밀하게 control할 수 있는 API입니다. 아주 단순한 경우에는 LSTM layer를 그대로 활용해도 큰 문제가 없지만, bidirectional LSTM에서는 LSTM layer를 time step 관점에서 거꾸로 적용을 해야 합니다. 다음의 코드를 참고하시기 바랍니다.

class Sentence_Representation(nn.Block):
    def __init__(self, emb_dim, hidden_dim, vocab_size, dropout = .2, **kwargs):
        super(Sentence_Representation, self).__init__(**kwargs)
        self.vocab_size = vocab_size
        self.emb_dim = emb_dim
        self.hidden_dim = hidden_dim
        with self.name_scope():
            self.f_hidden = []
            self.b_hidden = []
            self.embed = nn.Embedding(self.vocab_size, self.emb_dim)
            self.drop = nn.Dropout(.2)
            self.f_lstm = rnn.LSTMCell(self.hidden_dim // 2)
            self.b_lstm = rnn.LSTMCell(self.hidden_dim // 2)
            
    def forward(self, x, _f_hidden, _b_hidden):
        embeds = self.embed(x) # batch * time step * embedding
        self.f_hidden = _f_hidden
        self.b_hidden = _b_hidden
        # Forward LSTM
        for i in range(embeds.shape[1]):
            dat = embeds[:, i, :]
            _, self.f_hidden = self.f_lstm(dat, self.f_hidden)
        # Backward LSTM
        for i in np.arange(embeds.shape[1], 0, -1):
            dat = embeds[:, np.int(i - 1), :] # np.int() necessary
            _, self.b_hidden = self.b_lstm(dat, self.b_hidden)
        x = nd.concat(self.f_hidden[0], self.b_hidden[0], dim = 1)
        return x

LSTMCell의 결과물은 LSTM과 비슷하기는 하지만 모양이 약간 다릅니다. LSTMCell은 layer가 여러개일 것을 가정하지 않습니다. 그 대신에 만약 여러개의 layer를 쌓아야 한다면, 직접 loop 문을 통해서 layer를 쌓아야 합니다. 그러므로, LSTMCell의 결과물은 (Batch size $\times$ Hidden dim)입니다. 반면 LSTM은 (Layer 개수 $\times$ Batch size $\times$ Hidden dim)을 hidden state 값으로 돌려줍니다. 이전 blog에서는 hidden state를 reshape해주는 단계를 거쳤는데요, 여기서는 그런 단계를 거칠 필요가 없습니다. 그냥 hidden state를 concat 해주기만 하면 됩니다.

 x = nd.concat(self.f_hidden[0], self.b_hidden[0], dim = 1)

모든 time step의 hidden state에 접근

지금까지는 최종 Hidden state의 정보만을 classification에 활용하는 형태였습니다. 그런데 사실 꼭 그렇게만 사용할 필요는 없죠. 각 time step의 hidden state들을 모두 활용할 수도 있을 것입니다. 다음의 그림과 같이 구조를 만들 수도 있겠죠.

결국 매 time step에서 forward LSTM과 backward LSTM을 concatenate한 후 이들의 평균을 sentence representation으로 봅니다. 그렇게 하기 위해서는 매 time step마다 hidden state의 값을 알아야 하는데요. 이 때 LSTMCell layer를 매 time step마다 update해야 합니다. 매 time step이 update될 때마다 그 hidden state를 저장한 후 저장된 matrix를 concat해 주어야 합니다. 그런 후에 평균을 내는 것이지요. 이런 과정은 다음의 코드로 구현할 수 있습니다.

class Sentence_Representation(nn.Block): ## Using LSTMCell : Average entire time step
    def __init__(self, emb_dim, hidden_dim, vocab_size, dropout = .2, **kwargs):
        super(Sentence_Representation, self).__init__(**kwargs)
        self.vocab_size = vocab_size
        self.emb_dim = emb_dim
        self.hidden_dim = hidden_dim
        with self.name_scope():
            self.f_hidden = []
            self.b_hidden = []
            self.embed = nn.Embedding(self.vocab_size, self.emb_dim)
            self.f_lstm = rnn.LSTMCell(self.hidden_dim // 2)
            self.b_lstm = rnn.LSTMCell(self.hidden_dim // 2)
            
    def forward(self, x, _f_hidden, _b_hidden):
        f_hidden = []
        b_hidden = []
        
        self.f_hidden = _f_hidden
        self.b_hidden = _b_hidden
        
        embeds = self.embed(x) # batch * time step * embedding
        
        # Forward LSTM
        for i in range(embeds.shape[1]):
            dat = embeds[:, i, :]
            _, self.f_hidden = self.f_lstm(dat, self.f_hidden)
            f_hidden.append(self.f_hidden[0])
        
        # Backward LSTM
        for i in np.arange(embeds.shape[1], 0, -1):
            dat = embeds[:, np.int(i - 1), :] # np.int() necessary
            _, self.b_hidden = self.b_lstm(dat, self.b_hidden)
            b_hidden.append(self.b_hidden[0])

        f_hidden.reverse()
        _hidden = [nd.concat(x, y, dim = 1) for x, y in zip(f_hidden, b_hidden)]
        h = nd.concat(*[x.expand_dims(axis = 0) for x in _hidden], dim = 0)
        h = nd.mean(h, axis = 0)
        return h

역방향 LSTM과 순방향 LSTM은 서로 반대방향이므로, concat할 때 방향을 바꾸어주어야 합니다. 그 외에는 크게 문제될 부분은 없어 보입니다. 그렇게 concat한 후에는 각 time step으로부터 얻은 hidden state를 평균을 내야 합니다. hidden state의 dimension은 (Time Step $\times$ Batch size $\times$ Hidden dimension)이므로, 0번 축으로 평균을 냅니다.

각 time step에서 output만을 활용한다면..

각 time step에서의 LSTM output은 $o_t * h_t$로 정의됩니다. 여기서 $h_t$는 $t$ step에서의 hidden state를 의미합니다. Gluon에서는 각 time step의 output을 이요할 때, BidirectionalCell이라는 함수를 사용하면 쉽게 그 결과를 뽑을 수 있는데요. 다음과 같이 작성하면 됩니다. BidirectionalCell에는 LSTMCell 2개가 input으로 들어가서 문장의 순방향 및 역방향 정보를 scan하게 됩니다.

class Sentence_Representation(nn.Block):
    def __init__(self, **kwargs):
        super(Sentence_Representation, self).__init__()
        for (k, v) in kwargs.items():
            setattr(self, k, v)
        
        with self.name_scope():
            self.embed = nn.Embedding(self.vocab_size, self.emb_dim)
            self.drop = nn.Dropout(.2)
            self.bi_rnn = rnn.BidirectionalCell(
                 rnn.LSTMCell(hidden_size = self.hidden_dim // 2),  #mx.rnn.LSTMCell doesnot work
                 rnn.LSTMCell(hidden_size = self.hidden_dim // 2)
            )
            self.w_1 = nn.Dense(self.d, use_bias = False)
            self.w_2 = nn.Dense(self.r, use_bias = False)

    def forward(self, x, hidden):
        embeds = self.embed(x) # batch * time step * embedding
        h, _ = self.bi_rnn.unroll(length = embeds.shape[1] \
                                       , inputs = embeds \
                                       , layout = 'NTC' \
                                       , merge_outputs = True)
        # For understanding
        batch_size, time_step, _ = h.shape
        # get self-attention
        _h = h.reshape((-1, self.hidden_dim))
        _w = nd.tanh(self.w_1(_h))
        w = self.w_2(_w)
        _att = w.reshape((-1, time_step, self.r)) # Batch * Timestep * r
        att = nd.softmax(_att, axis = 1)
        x = gemm2(att, h, transpose_a = True)  # h = Batch * Timestep * (2 * hidden_dim), a = Batch * Timestep * r
        return x, att

간단히 그 결과는 unroll이라는 method로 얻을수 있겠습니다. 해당 method는 각 time step에서의 ouput을 첫번째 결과물로, 최종 hidden state의 값을 두번째 결과물로 return합니다. 우리는 첫번째 결과물만 중요하므로, 위와 같이 첫번째 return 값만 가져와서 작업을 진행할 수 있습니다. 코드는 여기에 정리되어 있습니다. 참고하세요.

결과

여전히 0.99의 accuracy를 보이네요. 3,149개의 리뷰 중에 총 38개의 리뷰가 오분류되었습니다. 그 중 하나를 보아하니, 제가 봐도 좋다는 건지 나쁘다는 건지 헷갈리네요.

이건 인정. 하지만, 나중에 보다 어려운 데이터를 바꿔가면서 어떤 식으로 LSTM을 응용했을 때 성능이 좋아지고 나빠지는지를 알아봐야 할 것 같습니다.

Sentiment Analysis - LSTM

Thu, 28 Jun 2018 17:00:00 +0900

시작하며

RNN은 주로 NLP에서 많이 사용되는 모형입니다. RNN은 이외에도 여러가지 이전 관측치의 값이 다음 관측치의 값에 영향을 미치는 시계열류의 데이터를 모델링하기 위해 많이 사용됩니다. RNN 이전에는 주로 ARIMA, Markov Random Field 등으로 풀던 문제였습니다. 문장을 하나의 숫자열로 표현하는 것은, 앞에서도 언급한 바 있지만, 어떻게든 token을 숫자화시키고 그 token을 하나의 값으로 나타내는 과정입니다. 어떠한 방법이든 token의 정보, 그리고 그 token들이 가지고 있는 여러가지 관계성 등이 유지가 되기만 한다면, 어떠한 방법도 사용할 수 있습니다. 그 방법들 중에서 인간이 문장을 인지하는 방식을 묘사하는 방식으로 모형을 고안된 방법들이 있는데, 대표적인 예가 RNN과 CNN, 그리고 최근에 각광받고 있는 Attention mechanism입니다. 문장에 등장하는 embedding된 단어의 요약을 하나의 pattern으로 보고 그것을 인식하여 단순히 분류하고 있다고 한다면, 보다 인간이 문장을 이해하는 방식을 따라하므로써, RNN과 CNN은 더욱 성능이 좋은 모형을 만들어 낼 수 있습니다. Sentiment analysis를 넘어선 neural translation에서는 보다 복잡한 모형들이 필요한 이유이기도 합니다.

Gluon에서 LSTM을 어떻게 사용하는지에 대한 내용을 찾아보기는 쉽지 않습니다. 그리고 API의 document 자체도 그리 훌륭하지는 않지만, 예제도 거의 찾아볼 수 없습니다. RNN에 대한 기본적인 내용들은 이미 많은 곳에서 알려져 있으니, sentiment analysis 과정을 통해 gluon에서 어떻게 LSTM 등 RNN을 사용하는지를 중심으로 알아보겠습니다.

Architecture

상상할 수 있는 구조는 아주 다양합니다. 단순히 hidden layer를 쓸 수도 있고, hidden layer를 여러층 사용할 수도 있을 것입니다. 각 time step의 output을 평균해서 사용할 수도 있을 것입니다. 본인의 기억을 위해 각각의 구조를 gluon으로 어떻게 반영하는지 정리해 보겠습니다. 그러면서 gluon LSTM API의 사용법에 대해서 자세히 기록하겠습니다.

LSTM cell의 구조

많은 곳에서 이미 LSTM의 구조에 대한 정보는 얻을 수 있습니다. 그럼에도 불구하고 한번 주지해야 할 사실은 LSTM에는 이전 time step에서 2개의 정보를 활용한다는 사실입니다. 다음의 그림에서 보면 cell state와 hidden state는 이전 time step의 결과물을 그 다음 time step의 입력으로 받아들입니다.

그냥 RNN과 GRU은 모두 1개의 hidden layer만 이전 input으로 받아들입니다. 이게 구현 상에서 좀 헷갈리는 면입니다. LSTM에서는 초기 정보를 입력 값으로 넣어주어야 하는데, 그 정보가 RNN 혹은 GRU와 사뭇 다를 수 있다는 겁니다. LSTM은 두개의 ndarray를 list의 형태로 넘깁니다. 그 list의 첫번째 ndarray는 hidden state에 대한 초기값을, 두번째 ndarray는 cell state에 대한 초기값을 뜻합니다. Document에는 이에 대한 명백한 언급 없이, 길이가 2인 list를 입력해야 한다고만 나와 있습니다. 자세한 내용은 source를 읽어야 파악할 수 있다는 거구요. 보다 자세한 설명이 있었으면 아주 좋았을 텐데 좀 아쉬운 부분입니다.

1개의 hidden layer의 마지막 layer를 사용하는 경우

전체 코드는 Single Hidden State를 참조하시기를 바랍니다.

가장 먼저 알아볼 기본 구조는 Token을 embedding 한 후에 이를 LSTM layer를 통과시킨 후, 그 결과물을 classifier의 입력으로 사용하는 구조입니다.

과거의 정보를 모두 감안/반영하는 LSTM의 구조상, 제일 마지막의 hidden state는 문장에 대한 모든 정보를 압축해 놓은 것이라고 생각한다면 마지막 time step에서의 hidden state만 활용해서 classifier의 입력으로 사용할 수 있습니다.

class Sentence_Representation(nn.Block):
    def __init__(self, emb_dim, hidden_dim, vocab_size, dropout = .2, **kwargs):
        super(Sentence_Representation, self).__init__(**kwargs)
        self.vocab_size = vocab_size
        self.emb_dim = emb_dim
        self.hidden_dim = hidden_dim
        with self.name_scope():
            self.hidden = []
            self.embed = nn.Embedding(vocab_size, emb_dim)
            self.lstm = rnn.LSTM(hidden_dim, dropout = dropout \
                               , input_size = emb_dim, layout = 'NTC')

    def forward(self, x, hidden):
        embeds = self.embed(x) # batch * time step * embedding: NTC
        lstm_out, self.hidden = self.lstm(embeds, hidden)
        return lstm_out, self.hidden

위에 있는 코드는 어떤 문장이 주어졌을 때 sentence representation을 LSTM을 기반으로 하는 코드입니다. 문장을 하나의 벡터로 표현하는 과정에서 위의 코드는 주어진 문장을 embedding layer를 통과시킨 결과물을 LSTM layer의 input으로 사용합니다.

embedding layer는 vocab_size를 input으로 받아서 객체를 만들고, 실제 학습을 위한 데이터는 token index의 sequence를 활용합니다.

self.embed = nn.Embedding(vocab_size, emb_dim)

위의 layer를 통과해서 나온 embedding 벡터는 (Batch size $\times$ Sentence length $\times$ Embedding dimension)이 됩니다.

input의 예를 들면, 전체 5개의 단어만 존재하고, 한 문장이 3개의 단어로 이루어져 있으며, 우리가 고려하는 최대의 문장 길이가 7이라고 하면, 다음과 같은 input을 가집니다. $\Rightarrow$ [5, 2, 4, 0, 0, 0, 0]

이렇게 embedding된 결과물은 lstm에서 input으로 사용되는데요. LSTM layer는 다음과 같이 선언됩니다.

self.lstm = rnn.LSTM(hidden_dim, dropout = dropout \
                               , input_size = emb_dim, layout = 'NTC')

가장 중요하게 볼 점 중에 하나는 layout keyword인데요. 여기에서 NTC로 적어줌으로써, Embedding 결과물을 transpose하지 않고 바로 입력으로 사용할 수 있습니다. transpose 연산은 상대적으로 아주 큰 연산이므로 피하는 것이 좋겠죠?

그런 후에 hidden state의 정보를 뽑아서 하나의 벡터로 문장을 표현하게 됩니다.

간단하게 다음과 같이 classifier를 정의합니다.

classifier = nn.Sequential()
classifier.add(nn.Dense(16, activation = 'relu'))
classifier.add(nn.Dense(8, activation = 'relu'))
classifier.add(nn.Dense(1))

지금 다루는 데이터는 워낙 분류가 잘 되는 쉬운 문제라 위와 같이 간단한 형태의 classifier면 충분히 좋은 성능이 나옵니다.

이렇게 두개 network를 정의하셨으면, 다음처럼 꼭 parameter를 initialize하시구요.

sen_rep.collect_params().initialize(mx.init.Xavier(), ctx = context)
classifier.collect_params().initialize(mx.init.Xavier(), ctx = context)

위의 두 network를 input으로 받아서 다음과 같이 최종 classifier를 정의합니다.

class SA_Classifier(nn.Block):
    def __init__(self, sen_rep, classifier, batch_size, context, **kwargs):
        super(SA_Classifier, self).__init__(**kwargs)
        self.batch_size = batch_size
        self.context = context
        with self.name_scope():
            self.sen_rep = sen_rep
            self.classifier = classifier
            
    def forward(self, x):
        hidden = self.sen_rep.lstm.begin_state(func = mx.nd.zeros, batch_size = self.batch_size, ctx = self.context)
        _, _x = self.sen_rep(x, hidden) # Use the last hidden step
        # Extract hidden state from _x
        _x = nd.reshape(_x[0], (-1, _x[0].shape[-1]))
        x = self.classifier(_x)
        return x    

여기서 구현 상에서 중요한 점은 hidden은 길이가 2인 list의 형태를 가진다는 점입니다. 물론, LSTM layer가 자체적으로 begin_state라는 method를 제공하기는 하지만, 앞으로 더 복잡한 모형들을 구현하다 보면, 다른 RNN과는 달라 헤맬 수 있는 부분입니다.

우리가 필요한 것은 각 time step의 output이 아니므로, 첫 번째 return값은 무시하고, hidden state 정보를 가지고 있는 두 번재 return값만 받아옵니다. 말씀 드린 것처럼, LSTM이므로 다음과 같이 hidden state를 뽑아내기 위해서는 _x의 첫 번째 원소를 가져와야 합니다.

 _x = nd.reshape(_x[0], (-1, _x[0].shape[-1]))

참고로 말씀드리자면, _x의 크기는 (Batch size $\times$ Hidden State)가 됩니다. _x를 classifier의 input으로 넣어서 DNN을 이용해서 최종 분류 작업을 진행합니다.

결과

워낙 쉬운 데이터셋이다 보니 accuracy 0.9라는 높은 성능을 얻습니다. 그것도 불과 5 epoch 이내에서 말입니다. 보다 자세한 결과물을 싣지는 않겠습니다. 그냥 위에 링크되어 있는 코드를 한번씩 돌려보시면 큰 무리 없이 이해하실 수 있을 것입니다. 다음은 LSTM layer가 2개인 경우 Gluon으로 어떻게 구현할 수 있는지를 간단하게 설명하겠습니다.

Sentiment Analysis - Relational Network

Thu, 21 Jun 2018 17:00:00 +0900

시작하며

문장을 컴퓨터가 이해할 수 있는 언어로 표현하는 방법에 대해서 계속 이야기 하고 있습니다. 문장을 단어든 문자든 token이라는 최소 단위로 나누고 이들을 어떤 식으로 요약을 해서 하나의 문장으로 요약을 해보자는 것입니다. 그래야만 컴퓨터가 학습을 수행할 수 있을테니까요.

가장 처음에 알아봤던 것은 BoW에 의한 방법입니다. BoW를 word2vec 등의 embedding으로 개선한 것이 CBoW를 사용한 방법들이구요. BoW에서는 token을 one-hot 벡터로 나타내고 이를 단순히 더하거나 평균을 낸 후 이 결과를 machine learning모형에 feeding을 하기만 하면 어느정도 quality의 분석 결과를 보여줍니다. 아주 naive하지만 강력한 방법으로 알려져 있습니다. 적어도 text classification 문제에서는 말입니다.

위의 모형은 그 모형이 문장 자체를 이해한다고 볼 수는 없을 것 같습니다. 그냥 단어가 출현하는 frequency로 패턴을 분류해낸다고 보는 것이 맞겠지요. 문장의 의미를 함축하거나 사람이 문장을 이해하는 과정을 모사하려는 시도는 각각 RNN과 CNN을 이용한 sentence representation에서 이루어집니다. 이에 대해서는 이미 알아보았습니다. 여기에서는 한걸음 더 나아가서, 문장에 속하는 단어 간의 관계를 찾아냄으로써, 문장을 표현하는 Relation Network을 소개하도록 하겠습니다.

Relation network

사실 제가 알기로는 relation network라는 이름은 2017년 Santoro의 논문에서 붙여진 것 같고, 그 개념은 skip-gram이라는 이름으로 이전부터 사용되어 오던 것 같습니다. (아니면 고쳐주시기 바랍니다.)

만약 우리가 $T$개 단어가 포함되어 있는 문장, $S$,에서 $i$ 번째 token인, $t_i$,를 표현하고자 한다고 해보죠. RN에서는 문장에 존재하는 모든 단어 (자기 자신을 포함하지 않으려면, $T-1$개의 단어)들에 대해 그 단어와의 관계를 하나의 벡터로 담는 것입니다. 다음과 같이 두 단어간의 관계를 표현 합니다.

$f(t_i, t_j) = W\phi(U_{left}e_i + U_{right}e_j )$

여기서 $U_{left}, U_{right}$와 $W$는 학습해야 할 대상입니다. $\phi$는 non-linear transformation입니다. 또한 $e_i, e_j$는 각각 $i$번째, $j$번째 token을 표현한 embedding입니다.

만약 $T$개의 token을 동일하게 가지고 있는 $N$개의 문장을 포함하고 있는 학습 데이터로 학습을 진행한다고 하면, $N \times \frac {T(T-1)}2 $개의 실질적인 데이터셋으로 학습을 모두 진행한 후에 $U_{left}, U_{right}$와 $W$를 학습하고 난 후에는, sentence representation은 다음과 같이 할 수 있습니다.

$RN(S) = \frac 2 {T(T-1)}\sum_{i=1}^{T-1}\sum_{j = i +1}^T f(x_i, x_j)$

모든 pairwise한 단어의 조합$\left(\frac{T(T-1)}2\right)$의 관계를 찾은 다음, 이를 평균을 내는 것입니다.

이렇게 되면, 하나의 문장은 1개의 벡터로 표현되는데 그 크기를 $d$($W$ 행렬의 행의 갯수)라고 하죠.

이 방법은 모든 단어들의 관계를 일정 하게 모두 표현합니다. 이 점이 앞에서 나온 CNN을 이용한 표현 방법과 뒤에서 다루게 될 Self-attention과는 다른 점이고, self-attention은 양 극단인 CNN과 relation network의 절충안이라고 보면 되겠습니다.

알고리즘 핵심 trick

구현의 핵심은 reshape에 있다고 해도 과언이 아닙니다. batch 구조를 유지한 상태에서 모든 token embedding의 조합을 표현해야 하기 때문에 matrix 연산에 많은 공을 들여야 하는 것이 사실입니다.

먼저 위에서 살펴본 $f(t_i, t_j)$ 함수는 다음과 같이 쓸 수 있겠습니다.

$f(t_i, t_j) = W\phi \left\{\left(U_{left}, U_{right}\right) \cdot \left(\begin{array}{c} e_i \\ e_j \end{array} \right) \right\}$

개별 inner product의 합이 아니라 $e_i$와 $e_j$를 쌓은 다음 하나의 커다란 행렬 $U = (U_{left}, U_{right})$을 weight값의 모임이라고 보는 것입니다. 엄밀하게 말하면 위의 표현과는 다른 표현이기는 하지만, 계산을 간단하게 할 수 있습니다.

저렇게 표현 하면, 다음과 같이 학습이 이루어집니다.

먼저 1개의 배치에 $b$개의 관측치가 있다고 할 때, 더이상 $b$개의 문장형태의 데이터가 들어가는 게 아닙니다. 우리는 하나의 문장에 들어 있는 $T$개의 token의 조합을 모두 고려 합니다. 그리고 그 조합은 각기 출력값에 대한 어떠한 정보를 고려하고 있습니다. 따라서, 우리는 $b \times T^2$의 feature와 answer의 pair를 가지고 있습니다. 그리고 더욱 많아진 데이터를 $U$와 $W$의 신경망에 흘려 backpropagation을 통해 학습을 시키게 되는 것입니다. 그러므로 우리는 하나의 문장에서 $i$ 번째 token과 $j$ 번째 token, $e_i$와 $e_j$,의 모든 조합 $T^2$개의 조합을 독립적인 input 벡터로 만들어주어야 하는데 이 작업을 위해서는 몇가지 trick이 필요합니다.

먼저 하나의 문장은 기본적으로 $T\times N$의 행렬로 표시가 됩니다. (그림에서 $n =T$, $m = N$으로 생각하시면 되겠습니다.) 여기서 $T$는 한 문장 안에 속해 있는 token의 갯수를 $N$은 Token의 차원을 의미합니다. 데이터에 나타나는 총 유의한 단어의 숫자일 수도 있고 단순히 자모일 수도 있습니다. 그리고 하나의 batch에는 mini-batch의 크기만큼의 문장이 들어갈 것입니다. 이 상황을 그림으로 표현하면, $T\times N$ 크기의 사각형이 3개 나열된 형태로 생각할 수 있습니다. 따라서 현재는 $3 \times T \times N$이라는 입력 벡터를 가지고 있는 상황입니다.

여기서 트릭의 시작인데요. 이 행렬을 1차원에 하나의 차원을 더 늘려서 $3 \times 1 \times T \times N$으로 만든 다음에 1차원 방향으로 $T$개 복사를 해서 위 그림의 마지막 줄에 있는 것처럼 복사된 데이터가 가로방향으로 $T$개 쌓아 놓은 $3 \times T \times T \times N$를 만듭니다. 다음은 아래의 그림에서 나타나 있는 과정처럼 세로로 같은 입력이 복사되어 있는 동일한 크기의 데이터를 만들어 놓습니다.

두개의 행렬을 4차원 방향으로 붙이게 되면, mini-batch당 다음과 같은 행렬을 얻을 수 있습니다.

저렇게 놓게 되면, 하나의 데이터에서 보면 $i$ 번째 $j$ 번째에 있는 $2\times N$ 열의 값은 앞의 $m$개 원소에는 $i$ 번째 token의 embedding이고, 뒤의 $m$개의 원소에는 $j$ 번째 token의 embedding이 들어 있게 됩니다. 이제 이렇게 만들어진 데이터를 $T^2$개의 독립적인 데이터로 풀게 되면 최종적으로 $3 \times T^ 2$개의 $2\times m$ 데이터가 새로 생기게 되고 이 것을 모형의 입력으로 사용한다는 것이죠. 이 부분이 relation network의 가장 핵심적인 부분입니다. 이제 이 내용을 코드를 보면서 확인해 보도록 하겠습니다.

Gluon을 이용한 코딩

먼저 모듈을 불러와야 겠습니다. 데이터를 불러와서 Data Iterator를 만드는 과정까지는 앞의 블로그 BoW를 이용한 text-classification를 참고하시기 바랍니다. 여기에서는 실제로 위에서 보여드린 trick이 어떻게 구현되어 있는지를 살펴보도록 하겠습니다.

먼저 코드 전체는 따로 보여드리고, 위에서 언급된 부분이 나와 있는 부분부터 확인해 보겠습니다.

def hybrid_forward(self, F, x):
    # (x_i, x_j)의 pair를 만들기
    # 64 배치를 가정하면
    x_i = x.expand_dims(1) # 64 * 1* 40 * 2000
    x_i = F.repeat(x_i,repeats= self.SENTENCE_LENGTH, axis=1) # 64 * 40 * 40 * 2000
    x_j = x.expand_dims(2) # 64 * 40 * 1 * 2000
    x_j = F.repeat(x_j,repeats= self.SENTENCE_LENGTH, axis=2) # 64 * 40 * 40 * 2000
    x_full = F.concat(x_i,x_j,dim=3) # 64 * 40 * 40 * 4000

위의 부분이 $i$번째 $j$번째 embedding을 조작하는 부분입니다. 주석으로 해당 레이어의 크기를 적어두었습니다. 주석은 batch size가 64라고 가정한 차원 계산입니다. 또한 중요한 것이 이 relation network는 아주 큰 네트워크인만큼 메모리 사용량에 신경을 써야 합니다.

    # batch*sentence_length*sentence_length개의 batch를 가진 2*VOCABULARY input을 network에 feed
    _x = x_full.reshape((-1, 2 * self.VOCABULARY))
    _x = self.g_fc1(_x) # (64 * 40 * 40) * 256
    _x = self.g_fc3(_x) # (64 * 40 * 40) * 256
    _x = self.g_fc4(_x) # (64 * 40 * 40) * 256

위의 부분은 $Vocab \times 2$만큼 커진 벡터를 $N\times T^2$개 만든 후에 4개의 layer를 거칩니다. 그렇게 되면 이제 classifier를 정의해야 할 시점입니다. 그렇게 해서 얻어진 두 단어 간의 관계 정보가 압축되어 있는 hidden 벡터의 크기는 256입니다.

    # sentence_length*sentence_length개의 결과값을 모두 합해서 sentence representation으로 나타냄
    x_g = _x.reshape((-1, self.SENTENCE_LENGTH * self.SENTENCE_LENGTH,256)) # (64, 40*40, 256) : .1GB
    sentence_rep = x_g.sum(1) # (64, 256): ignorable

만약 weight가 결정된 후에 feature map을 얻기 위해서는 token의 모든 조합에 대해 합을 낸다면, 모든 token 간의 관계가 고려된 $d$ 차원의 벡터를 얻게 됩니다. 그 뒤에 따라올 과정은 straightforward 합니다.

아래는 전체 classifier code입니다.

class RN_Classifier(nn.HybridBlock):
    def __init__(self, SENTENCE_LENGTH, VOCABULARY, **kwargs):
        super(RN_Classifier, self).__init__(**kwargs)
        self.SENTENCE_LENGTH = SENTENCE_LENGTH
        self.VOCABULARY = VOCABULARY
        with self.name_scope():
            self.g_fc1 = nn.Dense(256,activation='relu')
            self.g_fc2 = nn.Dense(256,activation='relu')
            self.g_fc3 = nn.Dense(256,activation='relu')
            self.g_fc4 = nn.Dense(256,activation='relu')

            self.fc1 = nn.Dense(128, activation = 'relu') # 256 * 128
            self.fc2 = nn.Dense(2) # 128 * 2
            # 1253632 param : 약 20MB
    def hybrid_forward(self, F, x):
        # (x_i, x_j)의 pair를 만들기
        # 64 배치를 가정하면
        x_i = x.expand_dims(1) # 64 * 1* 40 * 2000* : 0.02GB
        x_i = F.repeat(x_i,repeats= self.SENTENCE_LENGTH, axis=1) # 64 * 40 * 40 * 2000: 1.52GB
        x_j = x.expand_dims(2) # 64 * 40 * 1 * 2000
        x_j = F.repeat(x_j,repeats= self.SENTENCE_LENGTH, axis=2) # 64 * 40 * 40 * 2000: 1.52GB
        x_full = F.concat(x_i,x_j,dim=3) # 64 * 40 * 40 * 4000: 3.04GB

        # batch*sentence_length*sentence_length개의 batch를 가진 2*VOCABULARY input을 network에 feed
        _x = x_full.reshape((-1, 2 * self.VOCABULARY))
        _x = self.g_fc1(_x) # (64 * 40 * 40) * 256: .1GB 추가메모리는 안먹나?
        _x = self.g_fc2(_x) # (64 * 40 * 40) * 256: .1GB (reuse)
        _x = self.g_fc3(_x) # (64 * 40 * 40) * 256: .1GB (reuse)
        _x = self.g_fc4(_x) # (64 * 40 * 40) * 256: .1GB (reuse)

        # sentence_length*sentence_length개의 결과값을 모두 합해서 sentence representation으로 나타냄
        x_g = _x.reshape((-1, self.SENTENCE_LENGTH * self.SENTENCE_LENGTH,256)) # (64, 40*40, 256) : .1GB
        sentence_rep = x_g.sum(1) # (64, 256): ignorable

        # 여기서부터는 classifier
        clf = self.fc1(sentence_rep)
        clf = self.fc2(clf)
        return clf

Sentiment Analysis - MLP

Sun, 17 Jun 2018 17:00:00 +0900

시작하며

Text classification은 주어진 NLP에서 가장 쉽고 기본적인 task에 해당합니다. 문장에서 단어들의 pattern을 찾아 그 문장이 어느 범주에 속하는 것이 가장 기본적인 접근 방법이라고 한다면, 문장의 정보를 이해해서 문장을 분류해 낼 수 있다면 더욱 좋은 결과를 낼 수 있을 것입니다. 그 과정이 자연어 처리가 발전하는 과정일텐데요. 이번 블로그 시리즈에서는 아주 빠른 속도로 발전해 가는 자연어 처리 기법들을 적어보려고 합니다. 저도 전문가는 아니므로 초보자 눈높이에 맞춰 (곧 저의 눈높이), 글을 적어보려구요. 다양한 형태와 기법의 text classification부터 시작해서 결국에는 Neural Machine Tranlation(NMT)의 기법까지 다루어 보는 것이 목표입니다.

이 블로그는 6월 말에 있었던 조경현 교수님의 connect 재단 초청 강연을 보고 이제는 스스로 정리를 해야될 때가 되었다 싶어 쓰기 시작합니다. 언제나 느끼는 바이지만, 깔끔하게 정리된 수업을 듣다가 보니, 피상적으로 이해한다고 생각했던 것들은 십중팔구 이해했던 것이 아니었습니다. 그래서 또 저만의 flow로 다시 한번 구성해 보았습니다.

이 글에서는 가장 기본적인 Back of Word(BoW)를 이용한 text classification에 대해서 이야기 해 보겠습니다.

서설… Attention mechanism을 향해

이 글에서 소개해 드릴 BoW를 이용한 text classification은 문장을 분류함에 있어서 문장을 이해하여 해당 카테고리를 찾아간다고 볼 수 는 없습니다. 단순히 문장에 속해 있는 token의 패턴을 통해서 분류를 하는 형태입니다. 아주 초보적인 수준의 분류 기법입니다. 하지만, deep learning 기법들이 발달하면서 단어의 단순한 출현 빈도나 패턴 보다는 좀더 문장의 정보를 이해하기 시작했습니다. NLP에서 가장 먼저 각광받은 신경망이 Recurrent Neural Network(RNN)이었습니다. Convolutional Neural Network(CNN)은 이미지 처리에서 큰 활약을 하고 있었는데요. 그림의 지역적인 특성을 필터링하는 데 사용했던 Convolution 기법을 문장의 지역적인 특성과 문맥을 파악하는데 활용하기 시작했습니다. 최근에는 CNN을 활용한 처리 기법도 많이 나오는 상태구요. 하지만 정보를 어떤 형태로 축약하는지는 그렇게 알기 쉽지 않았습니다. 그리고 아주 긴 문장에 대한 처리는 여전히 문제였습니다. 이런 상황에서 최근에 아주 주목받고 있는 것이 attention mechanism입니다.

Attention mechanism은 자연어 처리 분야와 그 자연어 처리로부터 파생되는 수많은 영역에서 가장 활발하게 사용되고 있는 architecture로서 빠른 속도로 많은 모형에 적용되고 있습니다. 이제 모형의 성능 관점은 물론 해석 관점에서도 아주 중요한 요소로 여겨지고 있습니다. 특히 어제 조경현 교수님 수업을 듣고 나니 attention mechanism의 사용은 이제 선택이 아니라 필수라는 생각이 들더군요.

NLP 관련해서 첫 블로그인만큼, NLP에서 가장 기본이 되는 개념인 token representation과 sentence representation의 일반적인 이야기를 먼저 정리한 후에, 오늘 이야기할 BoW를 이용한 text classification을 진행해 보도록 하겠습니다.

NOTE: 아래의 분석 과정은 Text_Classification에 자세하게 나타나 있습니다. 참고하시기 바랍니다.

Sentence representation의 일반론

먼저 문장은 컴퓨터가 이해할 수 있는 숫자로 바꾸어야 분석을 진행할 수 있을 것입니다. 문장을 숫자로 바꾸는 단계를 두가지로 나눠 볼 수 있는데요, 문장의 요소 (단어가 될 수도 있고, 문자가 될 수도 있습니다. 이 단위를 token이라고 이야기 합니다.)를 숫자로 표현하는 단계(token representation)와 숫자화 된 단어들을 요약해서 문장을 숫자로 표현하는 단계(sentence representation)으로 나눠볼 수 있겠습니다.

Token representation

Token을 숫자로 표현하는 가장 직관적인 방법은 단어에 순서를 부여하는 것이겠지요. ‘너의 목소리가 들려’라는 문장이 나타나면, ‘너의’는 1번, ‘목소리가’는 2번, ‘들려’는 3번, 이런 식으로 번호를 붙여나가는 방법일 것입니다. 전형적인 숫자로 표현되는 범주형 자료형으로서, 이러한 데이터를 분석할 때, 숫자화된 token을 one-hot 벡터로 나타냅니다. 통계학에서는 ‘지시변수’ 혹은 indicator라고 부릅니다. 단어를 잘 표현하기 위해서 word2vec 등의 embedding을 사용하기도 합니다. 이 방법은 token representation을 소개하는 블로그에서 따로 다룰 예정이구요. (열심히 정리 중입니다. 해야할 게 너무 많네요..ㅠㅠ)

위에서 언급한 것과 같이 token이라는 개념은 다양하게 정의가 될 수 있습니다. 가장 일반적인 token의 단위는 단어(word)와 문자(character)입니다.

단어를 one-hot으로 나타내는 방법은 주어진 데이터 셋에 한번이라도 등장하는 단어를 모두 카운트 해서, 각 단어를 하나의 차원으로 보고 벡터를 만드는 것입니다. 이렇게 하다 보니 수많은 단어가 포함되어 있는 데이터셋을 단어의 one-hot 벡터로 만들려면 벡터의 차원이 아주 커진다는 단점이 있습니다. 보통 수천에서 수만 차원에 이릅니다. 특히 한국어의 경우, 단어의 변형이 너무나도 많습니다. 그래서 우리가 다루어야 할 embedding 공간이 아주 커지는 문제가 있습니다. 이 문제는 stem 단어 혹은 어근을 찾음으로써 어느정도 해결이 됩니다. 해결이 된다기 보다는 그런 식으로 처리를 합니다.

여기에다가 한국어에는 독특하게 단어들마다 띄워쓰기가 모두 다를 수가 있죠. ‘스타벅스’라고 쓰기도 하지만 ‘스타 벅스’라고 쓰기도 하면 이 것들을 하나의 단어로 인식해야 할 텐데 one-hot 벡터로는 이러한 차이를 반영할 수 없습니다. 다른 언어는 상대적으로 한국어에 비해 이러한 문제가 덜하다고 하는군요. 또한 단순 오타 또한 차원을 높이는 데 큰 영향을 주기도 합니다. 잘못 쓴 단어는 새로운 단어로 인식을 하게 되니까요.

하지만, 이렇게 얻어진 token representation은, 인지적으로 이해하기는 쉬울 것입니다. ‘대학교’, ‘중학교’, ‘고등학교’는 embedding 공간에서 아주 비슷한 곳에 모여 있을 것이고, 사람이 그 유사성을 인지하기에 문제가 없다는 것입니다.

반면 문자 단위의 representation은 상대적으로 작은 차원으로 표현을 할 수가 있습니다. 한국어로 표현되는 문자는 여전히 많은 수준이기는 하죠. 자음과 모음의 조합으로 아주 다양한 발음을 표현할 수 있는 특성 때문이죠. 반면, 영어의 경우에는 알파벳 개수와 몇몇 문장부호 정도가 전부입니다. 그래서 문자를 표현하기 위한 one-hot 벡터도 고작 몇십 차원에서 표현이 가능합니다. 따라서 단어를 처리하는 것보다는 문자를 처리하는 것이 훨씬 효율적일 수 있습니다. 하지만, 이렇게 문자 단위의 representation을 해놓게 되면, representation 공간 안에서 단어의 유사성들을 잘 표현할 수 있는가? 하는 질문이 당연히 생기게 됩니다. ‘대학교’의 ‘대’가 ‘대구’의 ‘대’와 같은 ‘대’일까요? 다른 ‘대’일까요? 가까운 ‘대’일까요? 먼 ‘대’일까요? 하지만, 최고 난이도의 번역에서조차 아주 잘 된다는 것이 지금까지 스터디 결과라고 하는군요. (자세한 내용은 저도 잘 모르겠습니다만, 모형의 성능이 그다지 떨어지지 않는다는 결과가 있다고 합니다. 저도 정확한 결과가 궁금하군요.) 성능이 잘 나오는 이유로는 고차원의 데이터 공간에서는 어떤 데이터 포인트의 주변에 있는 neighborhood가 충분히 많다는 것입니다.

Sentence representation

그 다음 필요한 것이 이렇게 숫자로 표현된 token representation을 바탕으로 문장을 표현하는 sentence representation 단계입니다. 약 5가지 정도의 sentence representation이 있는데요. 다음과 같습니다.

CBoW (BoW 포함)
CNN
Relation Network (Skip-gram)
Self-attention
RNN

물론, 이 외에도 수많은 variant들이 있겠죠. 제가 알고 있는 정도가 이정도입니다. 하지만 이 정도 표현을 벗어나지는 않는 것 같습니다. 오늘은 BoW만 다루어 보겠습니다. token을 one-hot으로 표현하느냐 아니면, embedding으로 표현하느냐에 따라 BoW와 CBoW로 나뉩니다.

BoW를 이용한 sentiment analysis

Sentiment analysis는 문장을 보고 그 문장에 나타나 있는 감정이 어떤 것인지를 예측하는 문제로 text classification의 가장 대표적인 문제입니다. 여기에서는 좀 식상하기는 하지만, umich-sentiment-train.txt로 제공되는 영화 평점 데이터에서 부정/긍정 대답을 구분하는 문제에 적용해 보겠습니다. 참고로 말씀드리자면, 워낙 쉬운 문제라 accuracy는 거의 1의 정확도를 보입니다. 앞으로 조금씩 더 어려운 데이터셋에 방법들을 적용해 가는 과정을 보여주는 것도 이 블로그의 목표이기도 합니다.

각각의 token에 대한 representation은 차원이 동일하므로, representation 벡터의 원소끼리 모두 더하거나 평균을 낼 수 있습니다. 이렇게 얻은 하나의 벡터를 해당 문장의 대표값으로 사용하는 것입니다. 당연히 단어의 순서는 전혀 고려되지 않습니다. 비록 순서를 무시하기는 하지만, CBoW 방법은 sentimental analysis와 같은 text classification에서는 아주 잘 작동을 합니다. 하지만, 단어의 순서가 중요한 machine translation이나 sentence generation에 적합한 방법은 아닐 것입니다.

수학적으로는 다음과 같이 단순하게 쓸 수 있겠지요. 먼저 문장마다 제각각이지만, 만약 하나의 문장에 들어 있는 token의 숫자가 $T$라 하면, 각 token을 $(e_1, \ldots, e_T)$로 나타냅니다. 그런다음 문장을 다음과 같이 표현한다는 것입니다.

$\frac 1 T \sum_{t=1}^T e_t$

혹은

$\sum_{t=1}^T e_t$

너무 쉽죠? 이게 전부입니다. 이제 숫자로 표현된 문장을 분류하기만 하면 됩니다.

Python을 이용한 구현

Python에서는 어떤 형태로 구현이 될까요? 먼저 다음의 module을 불러옵니다.

import os
import pandas as pd
import numpy as np
import nltk
import collections

네번째 줄에 있는 nltk 모듈은 가장 많이 사용되는 nlp를 위한 전처리 모듈로서 이 블로그에서는 문장을 단어 단위로 tokenize하기 위해서 사용됩니다. 다섯번째 줄에 있는 collections는 container에 들어있는 원소들의 count를 보다 쉽게 하기 위해 사용하는 module입니다. 기본적으로 python에서 제공해 주는 collection 기능보다 더욱 많은 기능을 제공합니다. 이제 본격적으로 데이터를 불러오도록 하겠습니다.

word_freq = collections.Counter()
max_len = 0
num_rec = 0

with open('../data/umich-sentiment-train.txt', 'rb') as f:
    for line in f:
        label, sentence = line.decode('utf8').strip().split('\t')
        words = nltk.word_tokenize(sentence.lower())
        if len(words) > max_len:
            max_len = len(words)
        for word in words:
            word_freq[word] += 1
        num_rec += 1

Collections에 있는 Counter class는 Container에 동일한 값의 자료가 몇개 있는지를 확인하는 객체입니다. 데이터가 저장되어 있는 file을 열고 개별 라인을 읽어오면서 label과 sentence를 분리합니다. 이렇게 얻어진 각각의 데이터에서 문장의 최대 길이와 전체 데이터셋에 등장하는 단어의 갯수를 셉니다.

이 분석에서는 최대 2000개의 단어만 고려할 것입니다. Word_freq에 저장되어 있는 정보를 활용해서 데이터에 가장 많이 등장하는 2000개 단어를 사용할 것입니다. 그런 다음, 한 문장을 구성하는 단어의 개수는 40개로 제한합니다. 만약 어떤 문장이 40개 이상의 단어로 구성이 되어 있으면, 최초 40개만 분석에 활용하고 그보다 짧은 문장이면 0을 채워 넣어서 40개 단어를 맞춥니다. 만약 2000개 안에 속하지 않는 단어를 만나면 1을 대입해서, 모르는 단어(Unknown)임을 표시합니다.

MAX_FEATURES = 2000
MAX_SENTENCE_LENGTH = 40
# most_common output -> list
word2idx = {x[0]: i+2 for i, x in enumerate(word_freq.most_common(MAX_FEATURES - 2))}
word2idx ['PAD'] = 0
word2idx['UNK'] = 1

{단어: 인덱스} 자료를 바탕으로 {인덱스: 단어} 자료도 만들어둡니다. 숫자로 표현된 자료를 원래 자연어로 돌리는데 사용하는 등 나중에 여러모로 활용할 수 있습니다.

idx2word= {i:v for v, i in word2idx.items()}
vocab_size = len(word2idx)

다음 코드에서는 위에서 정의된 유효한 단어, 한 문장을 이루는 단어의 개수, 유효하지 않은 단어의 처리 방법 등을 바탕으로 실제 데이터를 불러옵니다. 숫자화된 데이터는 x와 y에 저장을 하고 원래 문장 데이터는 origin_txt라는 변수에 저장이 됩니다.

y = []
x = []
origin_txt = []
with open('../data/umich-sentiment-train.txt', 'rb') as f:
    for line in f:
        _label, _sentence = line.decode('utf8').strip().split('\t')
        origin_txt.append(_sentence)
        y.append(int(_label))
        words = nltk.word_tokenize(_sentence.lower())
        _seq = []
        for word in words:
            if word in word2idx.keys():
                _seq.append(word2idx[word])
            else:
                _seq.append(word2idx['UNK'])
        if len(_seq) < MAX_SENTENCE_LENGTH:
            _seq.extend([0] * ((MAX_SENTENCE_LENGTH) - len(_seq)))
        else:
            _seq = _seq[:MAX_SENTENCE_LENGTH]
        x.append(_seq)

실제 데이터에 부정과 긍정의 평가가 어떤 빈도로 나타나는지 보면, 부정이 3091개 긍정이 3995개입니다.

pd.DataFrame(y, columns = ['yn']).reset_index().groupby('yn').count().reset_index()

이렇게 얻어진 단어의 index 벡터를 one-hot 벡터로 바꾸기 위해서 다음과 같은 함수를 정의합니다.

def one_hot(x, vocab_size):
    res = np.zeros(shape = (vocab_size))
    res[x] = 1
    return res

위에 정의된 함수를 바탕으로 다음을 실행하면, 이제 행의 갯수가 2000이고 열의 개수가 40인 one-hot 벡터로 구성된 행렬이 문장의 개수만큼 만들어집니다.

x_1 = np.array([np.sum(np.array([one_hot(word, MAX_FEATURES) for word in example]), axis = 0) for example in x])

머신러닝에서 가장 중요한 개념 중의 하나는 bia-variance trade-off입니다. 이를 잘 극복하기 위해서 사용하는 방법이 데이터를 training set과 validation set으로 나누고, 학습은 training set에서 시키고, 학습된 결과가 일반화될 수 있는지를 가늠하기 위해서 validation set에서 학습 결과를 평가합니다. 기존의 통계적인 방법론에서도 일반화를 고려하기 위해서요사용되던 개념이었지만, 머신러닝에서는 훨씬 더 그 중요도가 높아졌는데요. 통계적인 모형은 모형이 단순하여, 모형의 표현력이 떨어지는만큼, training set만 잘 맞추게 되는 over-fitting 문제가 크지 않은 반면, 머신러닝 모형은 over-fitting 문제가 훨씬 심합니다. 그래서 training set과 validation set을 나누는 것, 그리고 더 나아가서는 test set까지, 이렇게 3개의 데이터셋으로 나누는 것은 아주 중요한 과정입니다.

다음은 지금까지 처리된 데이터를 training set과 validation set으로 나누는 작업입니다.

tr_idx = np.random.choice(range(x_1.shape[0]), int(x_1.shape[0] * .8))
va_idx = [x for x in range(x_1.shape[0]) if x not in tr_idx]

tr_x = x_1[tr_idx, :]
tr_y = [y[i] for i in tr_idx]
va_x = x_1[va_idx, :]
va_y = [y[i] for i in va_idx]

여기까지 진행하면 classification을 위한 데이터는 모두 준비가 되었습니다. 이제 classifier로 분류 문제를 풀어볼 차례입니다.

Classification

몇가지 classifier를 적용해 보겠습니다. 먼저 xgboost를 적용하기 위해서는 다음의 library들이 필요하네요.

from xgboost import XGBClassifier
from sklearn.metrics import accuracy_score

마지막 줄은 성능을 평가하기 위한 준비작업으로 accuracy score를 구하기 위해 필요한 모듈입니다. xgboost 모형을 학습하기 위한 코드는 아주아주 간단합니다. 다음과 같이 클래서 선언 후, fit method를 호출하면 끝입니다. 물론 다양한 모수가 있기는 하지만, 여기에서는 default로 갑니다. 아주 잘 나옵니다.

xgb = XGBClassifier()
xgb.fit(tr_x, tr_y)

다음은 추정된 모형에 validation set을 대입해서 실제 예측치를 얻고 그 예측치에 대한 accuracy를 구하는 과정입니다.

y_pred_xgb = xgb.predict(va_x)
pred_xgb = [round(val) for val in y_pred_xgb]

accuracy_xgb = accuracy_score(va_y, pred_xgb)
print('Accuracy: %.2f%%'%(accuracy_xgb * 100.0))

위의 결과를 돌려보면, validation set의 accuracy가 0.98에 육박함을 알 수 있습니다. 그러니깐.. 대책없이 쉬운 문제라고 볼 밖에요.

이제 비슷한 scheme으로 DNN을 적용해 보겠습니다. 여러가지 많이 사용되는 framework 중에 후발주자로서 가장 늦게 시작하기는 하였지만, 늦게 출발한 만큼 기존에 활발하게 사용되고 있는 deep learning framework인 tensoflow, pytorch, keras의 장점만을 모아서 만들어 놓은 mxnet이라는 framework이 있습니다. 그리고 mxnet을 보다 쉽게 사용하기 위해 mxnet을 wrapping한 gluon이라는 framework는 쉽기도 하고 유연하기도 해야 하는 deep learning framework의 요건을 모두 만족시키는 framework로 생각됩니다. Keras는 쉽기는 하나 점점 복잡해지는 deep learning network을 표현하기에는 좀 부족해 보이는 반면, gluon은 keras만큼 쉬우면서도 유연한 tool이라는 게 개인적인 생각입니다. mxnet과 gluon을 사용하기 위해서는 다음과 같이 필수 module을 import합니다.

import mxnet as mx
from mxnet import gluon, autograd, nd
from mxnet.gluon import nn
context = mx.gpu()

tensorflow에서는 GPU와 CPU를 오가는 것이 그렇게 user-friendly 정의가 되어있지 않았습니다. 어떤 장비를 사용하는지에 따라 graph가 꼬이기도 하고, 뭔가 말로 표현하기는 좀 어렵지만, 굉장히 사람을 신경쓰이게 하는 면이 있었습니다. 하지만, gluon에서는 context를 지정하는 것으로 어떤 resource를 이용하는가에 대한 고민은 크게 하지 않아도 됩니다. GPU를 사용하고 싶으면 mx.gpu(0)를 CPU를 사용하고 싶녀면 mx.cpu(0)을 지정하면 됩니다.

Gluon의 기본적인 programming style은 pytorch를 따릅니다. 다음과 같이 nn.Block class를 상속 받아서 구현하고자 하는 network를 정의하는 형태인데요. name_scope 안에 network에서 사용할 weight값이 들어 있는 layer들을 정의합니다. 그리고 실제 feed-forward 계산은 forward method에서 이루어집니다. Class 내에서 network를 쌓아가는 모습은 pytorch의 형태와 비슷하기는 하지만, 이를 Keras의 형태로 network를 정의할 수도 있습니다. 아마도 앞으로 nn.Sequence를 상속받은 class를 통해 network를 정의하는 예제도 따라 나올 것 같습니다.

class MLP(nn.Block):
    def __init__(self, input_dim, emb_dim, **kwargs):
        super(MLP, self).__init__(**kwargs)
        with self.name_scope():
            self.embed = nn.Embedding(input_dim = input_dim, output_dim = emb_dim)
            self.dense1 = nn.Dense(64)
            #self.dense2 = nn.Dense(32, activation = 'relu')
            self.bn = nn.BatchNorm()
            self.dense2 = nn.Dense(2)

    def forward(self, x):
        x = self.embed(x)
        x = self.dense1(x)
        x = self.bn(x)
        x = nd.relu(x)
        x = self.dense2(x)
        return x

위에서는 1개의 dense layer를 이지고 있는 간단한 network를 정의했습니다. Batch-normalization을 적용했고, RELU activation 함수를 사용했죠. 그 결과 나오는 마지막 output은 2개의 node를 가지게 됩니다. softmax를 사용해서 이 두개의 node값을 긍정과 부정의 확률값으로 표현하게 될 것입니다. 다음과 같이 실제로 class를 객체화시킨 후에 mlp라는 객체가 가지고 있는 모수들, 다시 말하면 weight값,들을 초기화하는 단계를 거칩니다. Xavier initializer를 사용합니다. 물론, GPU를 사용하기 위해 context지지정해 줍니다.

mlp = MLP(input_dim = MAX_FEATURES, emb_dim = 50)
mlp.collect_params().initialize(mx.init.Xavier(), ctx = context)

학습을 시키기 위해서는 loss 함수와 optimizer를 정해야 하는데요. gluon에서는 간단하게 다음과 같이 정의할 수 있습니다.

loss = gluon.loss.SoftmaxCELoss()
trainer = gluon.Trainer(mlp.collect_params(), 'adam', {'learning_rate': 1e-3})

SoftmaxCELoss는 Softmax를 적용한 후 Cross Entropy Loss를 적용하라는 의미입니다. 여기에서 softmax가 적용될 것이기 때문에 위에서 network를 정의할 때에는 최종 output layer의 마지막에는 따로 activation 함수를 지정하지 않았습니다. trainer에는 학습을 해야할 모수와 optimizer의 종류, 그리고 optimize에 필요한 hyperparameter를 넣어주어야 합니다. 이제 학습을 하기 위해 모형 관련된 내용들은 모두 정의가 된 상태입니다. 요약하면, DNN을 수행하기 위해서는 다음의 4가지 정도를 꼭 정해주어야 한다는 거죠.

Network architecture
Optimizer
Loss Function
Hyper parameter

Data를 network에 feeding할 때, deep learning에서는 mini batch를 사용하게 됩니다. 큰 데이터를 메모리에 담을 수 없어 나오게 된 현실적인 고려인데, 이렇게 데이터의 일부분만으로 모수를 업데이트 해도 평균적으로 잘 된다는 게 알려진 사실이고, 그래서 이렇게 임의로 뽑은 일부의 데이터만 활용해서 모수를 갱신하는 방법을 SGD 방법이라고 하죠. SGD의 여러가지 변종들이 Adam이니 Adadelta니 하는 optimizer입니다. 따라서 mini-batch의 크기만큼 데이터를 계속 잘라서 network에 넣어주어야 하는데요, 이런 작업들을 쉽게 할 수 있도록 gluon에서는 NDArrayIterator라는 class를 제공합니다. 다음과 같이 사용합니다.

train_data = mx.io.NDArrayIter(data=[tr_x, tr_y], batch_size=batch_size, shuffle = False)
valid_data = mx.io.NDArrayIter(data=[va_x, va_y], batch_size=batch_size, shuffle = False)

이렇게 하면 iterator로 정의한 것이므로, 메모리에 대한 걱정도 사라지게 됩니다.

이제 모형 관련된 준비 사항 및 데이터 관련 준비 사항도 모두 끝이 났습니다. 이제 이런 설정들을 바탕으로 실제 학습을 진행하면 됩니다. 다음은 실제 코드입니다.

for epoch in tqdm_notebook(range(n_epoch), desc = 'epoch'):
    ## Training
    train_data.reset()
    n_obs = 0
    _total_los = 0
    pred = []
    label = []
    for i, batch in enumerate(train_data):
        _dat = batch.data[0].as_in_context(context)
        _label = batch.data[1].as_in_context(context)
        with autograd.record():
            _out = mlp(_dat)
            _los = nd.sum(loss(_out, _label)) # 배치의 크기만큼의 loss가 나옴
            _los.backward()
        trainer.step(_dat.shape[0])
        n_obs += _dat.shape[0]
        #print(n_obs)
        _total_los += nd.sum(_los).asnumpy()
        # Epoch loss를 구하기 위해서 결과물을 계속 쌓음
        pred.extend(nd.softmax(_out)[:,1].asnumpy()) # 두번째 컬럼의 확률이 예측 확률
        label.extend(_label.asnumpy())
    #print(pred)
    #print([round(p) for p in pred]) # 기본이 float임
    #print(label)
    #print('**** ' + str(n_obs))
    #print(label[:10])
    #print(pred[:10])
    #print([round(p) for p in pred][:10])
    tr_acc = accuracy_score(label, [round(p) for p in pred])
    tr_loss = _total_los/n_obs

    ### Evaluate training
    valid_data.reset()
    n_obs = 0
    _total_los = 0
    pred = []
    label = []
    for i, batch in enumerate(valid_data):
        _dat = batch.data[0].as_in_context(context)
        _label = batch.data[1].as_in_context(context)
        _out = mlp(_dat)
        _pred_score = nd.softmax(_out)
        n_obs += _dat.shape[0]
        _total_los += nd.sum(loss(_out, _label))
        pred.extend(nd.softmax(_out)[:,1].asnumpy())
        label.extend(_label.asnumpy())
    va_acc = accuracy_score(label, [round(p) for p in pred])
    va_loss = _total_los/n_obs
    tqdm.write('Epoch {}: tr_loss = {}, tr_acc= {}, va_loss = {}, va_acc= {}'.format(epoch, tr_loss, tr_acc, va_loss, va_acc))

마치며

단순히 단어의 one-hot representation만으로도 성능이 높은 모형을 구축할 수 있었습니다. 이 데이터셋이 가장 entry level의 쉬운 데이터셋이기도 합니다만, 사실 text classification에서는 이정도의 technique로도 충분히 좋은 성능을 얻을 수 있다고 합니다. 다음 글에서는 CNN을 이용한 sentence representation을 해보도록 하죠.

GAN과 WGAN 사이 - II

Fri, 08 Jun 2018 17:00:00 +0900

시작하며..

지난 글에서 논의된 내용들을 정리해 보겠습니다. 다시 한번 remind하면, Optimal인 판별기를 가정한 경우의, GAN의 손실함수는

$L(G,D^* ) = 2D_{JS}(p_{data}||p_G) - 2\log 2$

입니다. 개념적으로는 GAN이 생성된 데이터와 원래 데이터 간의 JS 거리를 가장 작게 하는 방향으로 학습이 되고 있다고 할 수 있겠습니다.

하지만, JS 거리가 언제나 잘 정의가 되는 것은 아니었죠. Support를 공유(absolute continuity 가정)해야만 가능한 것으로, manifold hypothesis에 의하면, support를 공유하지 않을 활률이 아주 높다는 것입니다. 그렇다면 좀더 안정적으로 정의되는 분포간의 거리를 재는 measure가 필요한데요, W-거리가 그러한 성질을 만족합니다. W-거리는 비교하는 두 분포 간의 absolute continuity를 요구하지 않는 거리입니다. 대신, 대상이 되는 분포의 absolutely continuity만 요구합니다. 기존의 조건보다 많이 완화된 조건입니다. 거기다, KL/JS divergence가 0으로 수렴하지 않더라도, W-거리는 0으로 수렴할 수 있고, W 거리가 수렴하면, KL/JS divergence는 수렴을 해야만 합니다. 그러니, W를 0으로 만드는 작업이 KL/JS divergence를 0으로 만드는 작업보다 훨씬 쉬울 것 같습니다. 훨씬 안정적으로 정의될 수 있다는 것을 의미합니다.

그래서 W-거리를 구했지만, 이 metric은 원래의 정의에 따르면 거리를 구하는 것만으로도 벅차다는 또다른 난관에 봉착했죠. 이번 글에서는 어떻게 이 문제를 풀어내는지를 살펴 보겠습니다.

Kantorvich - Rubinstein Duality

$P_\theta$와 $P_r$을 Discrete한 분포라고 가정하면, $\gamma(x,y)$를 다음과 같은 matrix로 나타낼 수 있습니다.

$\gamma = \left(\begin{array}{ccc} \gamma(x_1, y_1) &\cdots &\gamma(x_1, y_l) \\ \vdots &\cdots & \vdots \\\gamma(x_l, y_1) &\cdots &\gamma(x_l, y_l)\end{array}\right)$

그리고 support의 원소의 거리를 나타내는 matrix, $D$를 다음과 같이 정의하면,

$D = \left(\begin{array}{ccc} \|x_1 - y_1\| &\cdots &\| x_1 - y_l\| \\ \vdots &\cdots & \vdots \\\|x_l - y_1\| &\cdots &\|x_l - y_l\|\end{array}\right)$

위의 두 행렬의 Fabuluous norm으로 기대값을 구할 수 있습니다. $\left<D, \gamma \right>_F = E_{(x, y) \sim \gamma(x, y)} \|X-y\|$

$vec$ operator는 행렬을 벡터로 표현하는 operator입니다. 위의 행렬은 길이가 $l^2$인 벡터들로 바뀝니다.

$d = vec(D), g = vec(\gamma)$

다음의 cost, $c = d^Tg$

를 최소화하고 싶은 목적함수로 하되, 이 최적화 과정에는 다음과 같은 constraint들이 있습니다.

$d$의 모든 원소는 0보다 크다
$\sum_{i=1}^l \gamma(x_i, y) = P_r(y)$
$\sum_{j=1}^l \gamma(x, y_j) = P_\theta(x)$

위의 조건을 표현하기 위해 적절히 design 행렬 $A$를 정의하면,

위의 조건 중 두번째 세번째 조건은 다음과 같이 쓸 수 있습니다. $A\cdot x = b$

요약하면, EM distance를 구하는 것은 다음의 linear programming 문제를 푸는 것과 같습니다.

$min_{g} c = min_{g} d\cdot g$

s.t. $g \ge 0$ and $Ax = b$

NOTE: 다음은 python으로 moving cost를 최소화한 transportation plan을 찾는 방법을 simulation 해놓은 예제입니다.

import numpy as np
from scipy.optimize import linprog

p_r = (.1, .2, .1, .2, .1, .2, .1)
p_t = (.1, .1, .1, .1, .1, .1, .4)

l = 7
A_r = np.zeros((l,l,l))
A_t = np.zeros((l,l,l))

for i in range(l):
    for j in range(l):
        A_r[i,i,j] = 1
        A_t[i,j,i] = 1  

D = np.zeros((l, l))
for i in range(l):
  for j in range(l):
    D[i, j] = np.abs(i-j)

A = np.concatenate((A_r.reshape((l, l**2)), A_t.reshape((l, l**2))), axis = 0)
b = np.concatenate((p_r, p_t), axis = 0)
c = D.reshape((l**2))

opt_res = linprog(c, A_eq = A, b_eq = b)
emd = opt_res.fun
gamma = opt_res.x.reshape((l, l))

위에서 $\gamma(x,y)$는 최적의 transportation이고 다음은 해당 transportation plan의 결과물입니다.

array([[0.1, 0. , 0. , 0. , 0. , 0. , 0. ],
       [0. , 0.1, 0.1, 0. , 0. , 0. , 0. ],
       [0. , 0. , 0. , 0. , 0. , 0. , 0.1],
       [0. , 0. , 0. , 0.1, 0. , 0.1, 0. ],
       [0. , 0. , 0. , 0. , 0.1, 0. , 0. ],
       [0. , 0. , 0. , 0. , 0. , 0. , 0.2],
       [0. , 0. , 0. , 0. , 0. , 0. , 0.1]])

위의 배열을 가로 세로로 모두 합해 보면, 두 분포를 marginal 분포로 가지는 결합확률분포이을 쉽게 알 수 있습니다.

큰 문제에 대해서는 위와 같이 바로 transportation plan을 찾기가 어렵습니다. 이를 위해 dual problem을 정의합니다. 모든 최적화 문제는 dual problem이 존재합니다. (문제에 따라 강쌍대성과 약쌍대성을 만족합니다.) 위 문제의 dual 문제는 다음과 같이 쓸 수 있습니다.

$max_{y} \tilde c = b^T y$

s.t. $A^Ty \le d$

원래 도출했던 최적화 문제를 primary form이라 하고, 바로 위에 정의된 문제는 dual form이라고 합니다.

Kantorvich의 formulation은 linear programming의 정준(canonical) 형태로 나타낼 수 있고, 이는 강쌍대성(strong duality)이 성립합니다. 강쌍대성을 가지는 문제는 primary/dual 문제로부터 얻은 solution이 일치한다는 것을 의미하고 우리는 두개의 문제 중에 보다 해를 찾아내기 쉬운 문제를 풀면 됩니다. 이 문제에서는 목적식에 $P_r$과 $P_\theta$가 나타나므로, dual form이 더욱 직관적입니다.

dual form에서 목적함수의 최대값, $\tilde c = b^T y^* $를 다음과 같이 표현할 수 있습니다. $y^* = \left(\begin{array}{c} \boldsymbol f \\ \bf g\end{array}\right)$ $\boldsymbol f$와 $\boldsymbol g$는 각각 $\mathbb R^l$에 속해 있습니다. 목적함수를 다음과 같이 표현할 수 있습니다.

$\tilde c = \sup \boldsymbol f ^T P_\theta + \boldsymbol g P_r$

제약식 $A^Ty$으로부터 다음의 관계를 얻을 수 있습니다.

$f(x_i) + g(x_i) \le D_{i,i}$

$D_{i,i}$는 모든 $i$에 대해서 0입니다. 자기 자신과의 거리는 0이기 때문입니다.

$P_\theta$와 $P_r$이 모두 양의 값을 가지므로 (확률의 공리입니다.) EMD를 극대화하기 위해서는 $\boldsymbol f^T P_\theta + \boldsymbol g^T P_r$이 극대화되는 점이 곧 EMD가 극대화되는 점이고, $f(x_i) \le -g(x_i)$이므로, $f(x_i) = - g(x_i)$인 점에서 극대화 되고, 결국 $\boldsymbol f^T P_\theta + \boldsymbol g^T P_r$은 0인 경우 EMD가 가장 극대화됩니다. (여기서 EMD를 극대화시킨다는 것은 dual 공간에서 작업하기 때문입니다.)

이 조건을 활용하면, 제약식을 다음과 같이 쓸 수 있게 됩니다.

$f(x_i) - f(x_j) \le D_{i, j}$

다시 말하면, 위의 조건을 만족하는 함수족에 대해서 EMD를 극대화하면 된다는 것을 의미합니다. 거기다 EMD 또한 다음과 같이 쓸 수 있습니다.

$\tilde c = f^T P_\theta - f^T p_r = E_{P_\theta}(f) - E_{P_r}(f)$

요약하면, EMD를 구함에 있어서, 모든 가능한 joint distribution의 모임에서 cost를 가장 작게 하는 joint distribution을 골라서, 그 joint distribution에 대해 기대값을 구해야 하는 작업을, Lipschitz 조건을 만족하는 함수족 중 $E_{P_\theta} (f) - E_{P_r}(f)$를 가장 크게 하는 함수를 찾고 그 함수에 대한 기대값을 구하는 문제로 전환한 것입니다.

지금까지 설명해 놓은 Kantorovich-Rubinstein duality에 의한 W-거리는 다음과 같이 정의됩니다.

$W(P_r, P_\theta) = \sup_{\|f\| \le 1} E_{X\sim P_r}(f(X)) - E_{X\sim P_\theta}(f(X))$

이렇게 정의함으로써 우리는 모든 transportation에 대한 $\inf$를 구하지 않아도 됩니다. Lipschitz continuous 함수 중에서 가장 주어진 분포 하에서 기대값을 가장 크게 하는 값이 W-거리가 됩니다. $K$- Lipschitz continuous 함수는

$|f(x) - f(y) | \le K |x - y|$

과 같이 정의된 함수로서 함수가 정의된 모든 점에서 함수의 기울기가 $K$보다 작은 함수를 의미합니다. 참고로 이 함수족은 continuous 할 필요는 없습니다.

다음이 성립합니다.

$\sup_{\|f\| \le K} E_{X\sim P_r}(f(X)) - E_{X\sim P_\theta}(f(X)) = K\cdot W(P_r, P_\theta)$

만약 $K$ - Lipschitz 함수족으로 거리를 정의한다면, 이는 W-거리의 $K$배로 정의가 됩니다.

이제 WGAN에 어떻게 위의 거리가 적용되는지를 살펴보겠습니다.

WGAN의 학습 알고리즘

여전히 Lipschitz 조건을 만족하는 함수의 모임은 너무나 큽니다. 그렇기 때문에 모든 Lipschitz 조건을 만족하는 함수보다는 특정 parameter, $w$로 표현할 수 있는 함수만을 고려합니다. 그리고 이 함수가 $| f_w| \le 1$을 만족한다고 하죠. 만약 parameter의 공간을 $\mathcal W$라고 하면 다음의 관계를 생각할 수 있습니다.

$\sup_{w \in \mathcal W} E_{P_r}(f_w(X)) - E_{P_\theta}(f_w(x)) \le \sup_{\|f\| \le K} E_{X\sim P_r}(f(X)) - E_{X\sim P_\theta}(f(X)) = (a)$

만약 운이 좋게도 $f_w$가 Lipschitz 함수족 중에 (a)를 극대화하는 경우라면 정확하게 W-거리를 찾을 수 있겠지만, 그 자체는 거의 불가능하다고 생각이 됩니다. 하지만 computing cost를 고려한 현실적인 대안으로 여긴다면, 적절한 근사값이 될 수도 있을 것 같습니다.

만약 $f_w$가 W-거리를 적당하게 잘 measure했다면, 이를 바탕으로 Loss의 gradient를 구할 수 있을텐데, 다음과 같이 $\theta$에 대해 편미분을 하면 첫번째 항은 $\theta$와 상관이 없으므로 사라지고, 두번째 항만 남습니다.

$\frac{\partial}{\partial \theta}\sup_{w \in \mathcal W}E_{X\sim P_r}(f(X)) - E_{X\sim P_\theta}(f(X)) = - \frac{\partial}{\partial \theta} E_{P_\theta}(f_w(x)) = -E_{P_\theta}\left(\nabla f_w(x)\right)$

위의 식에 근거해서 모수 $\theta$를 학습하게 되는거죠.

기존의 GAN에서는 discrimator가 있어서, fake와 real의 분포를 보고 생성된 데이터가 real인지 fake인지 판단을 합니다. 하지만, WGAN에서는 그렇게 이분법적으로 판단하는 것이 아니라, 얼마나 실제와 유사한지를 W-거리를 재어 그 거리를 줄여주는 방향으로 generator를 업데이트 합니다.

	GAN	WGAN
Discriminator	JS divergence를 통해 판별기를 학습	$\cdot$
Critique	$\cdot$	Wassersten 거리를 구하기 위한 Lipschtz 함수를 근사
Generator	판별 결과를 1에 가깝도록 학습	데이터 분포와의 W 거리를 줄이는 방향으로 학습

요약하면, 다음과 같은 순서를 따릅니다.

$\theta$가 고정된 상태에서 $W(P_r, P_\theta)$를 구합니다. 거리를 가장 잘 구하기 위해서는 $E_{X\sim P_r}(f_w(X)) - E_{X \sim P_\theta}(f_w(X))$의 supremum을 구해야 하므로, 목적함수를 증가시키는 방향으로 $f_w$의 모수인 $w$를 학습시킵니다. 정확한 거리를 재기 위한 단계입니다.
$f_w$가 어느정도 수렴한 후에는, 다시 말하면, 나름 정확한 W 거리를 찾고 나면, 이 함수를 이용해서 목적함수의 gradient인 $- E_{P_\theta}(\nabla f_w(x))$를 추정합니다. 이때에는 거리를 최소화 해야 하므로, 목적함수를 줄이는 방향으로 $\theta$를 학습합니다.

다음은 논문에서 제시된 알고리즘입니다.

마치며

다음 글에서는 실제 구현체를 통해 WGAN이 어떻게 coding되는지 알아보도록 하겠습니다.

참고문헌

https://www.cph-ai-lab.com/wasserstein-gan-wgan
https://vincentherrmann.github.io/blog/wasserstein/
https://www.alexirpan.com/2017/02/22/wasserstein-gan.html