卷积神经网络(TextCNN)在句子分类上的实现

说明

本篇博客记录的是论文Convolutional Neural Networks for Sentence Classification中的实验实现过程，一篇介绍使用CNN对句子进行分类的论文。尽管网上有些代码已经实现了使用CNN进行句子分类(TextCNN),但是是基于Theano来实现的，本文将介绍使用TensorFlow来实现整个论文的实验过程，一方面熟悉使用TensorFlow API,另一方面加深自己对CNN在NLP上的应用的理解。
实例的Github地址

论文实验思路

1. 实验模型图

先上图再解释
TextCNN模型架构

图中展示的是TextCNN模型架构，句子中每个word使用K维向量来表示，于是句子可表示为一个N*K的矩阵，作为CNN的输入。

2. 实验前存在的疑问

2.1 Word Embedding，采用什么方式进行Embedding(one-hot or word2vec or glove)效果较好。
2.2 CNN的输入NK 中的N怎么定义，即输入的句子的序列的长度怎么定义，因为对于不同的句子，包含的词的数量是不一样的。而CNN的输入是需要固定的矩阵NK。
2.3 对于不在词汇表中的词是怎么Embedding.

3.TextCNN模型说明及实验介绍

3.1 数据集

论文中做的实验使用了多个数据集，而我实验的过程中只使用了MR数据集，验证方式是10 folds的交叉验证方式。

MR: Movie reviews with one sentence per review. Classification involves detecting positive/negative reviews.
Specifically:
rt-polarity.pos contains 5331 positive snippets
rt-polarity.neg contains 5331 negative snippets

3.2 实验中的模型类别

CNN-rand: 句子中的的word vector都是随机初始化的，同时当做CNN训练过程中需要优化的参数；
CNN-static: 句子中的word vector是使用word2vec预先对Google News dataset (about 100 billion words)进行训练好的词向量表中的词向量。且在CNN训练过程中作为固定的输入，不作为优化的参数;
CNN-non-static: 句子中的word vector是使用word2vec预先对Google News dataset (about 100 billion words)进行训练好的词向量表中的词向量。在CNN训练过程中作为固定的输入，做为CNN训练过程中需要优化的参数；
说明：

3.2.1 GoogleNews-vectors-negative300.bin.gz词向量表是通过word2vec使用命令预先训练好，花费时间较长。
已经训练好的：GoogleNews-vectors-negative300.bin.gz百度云盘下载地址密码:18yf
3.2.2 word2vec预先训练命令如：./word2vec -train text8(语料) -output vectors.bin(输出词向量表) -cbow(训练使用模型方式) 0 -size 48 -window 5 -negative 0 -hs 1 -sample 1e-4 -threads 20 -binary 1 -iter 100
3.2.3 除了使用word2vec对语料库进行预先训练外，也可以使用glove或FastText进行词向量训练。

3.3. 模型架构介绍

模型参数

rectified linear units线性修正单元

filter Windows的h大小：3,4,5；对应的Feature Map的数量为100

dropout rate (p) 为0.5，l2 constraint (s)为3,

mini-batch size 为50.

梯度下降算法学习率0.05

3.3.1输入层
如上图中所示，对于模型的输入是由每个句子中的词的词向量组成的矩阵作为输入层的输入N*K,其中K为词向量的长度，N为句子的长度。词向量的表示方式有3种，CNN-rand、CNN-static、CNN-non-static。对于没有出现在训练好的词向量表中的词(未登录词)的词向量，论文实验中采取的是使用随机初始化为0或者偏小的正数表示。—疑问(2.3)(可认为采用的是平滑处理方式)
3.3.2卷积层
在输入层的基础上，使用Filter Window进行卷积操作得到Feature Map。实验中使用的3种类型大小的Filter Window,分别是3*K,4*K,5*K，K表示词向量的长度。其中每种类型大小的Filter Window 有100个含有不同值的Filter。每一个Filter能从输入的矩阵中抽取出一个Feature Map特征，在NLP中称为文本特征。
实验中对Feature Map的池化操作方式是Max-over-time Pooling的方式，即将每个Feature Map向量中最大的一个值抽取出来,组成一个一维向量。
3.3.3全连接层
该层的输入为池化操作后形成的一维向量，经过激活函数输出，再加上Dropout层防止过拟合。并在全连接层上添加l2正则化参数。
3.3.4输出层
该层的输入为全连接层的输出，经过SoftMax层作为输出层，进行分类。对于多分类问题可以使用SoftMax层,对于二分类问题可以使用一个含有sigmod激活函数的神经元作为输出层，实验中采用的是SoftMax层。

论文代码详解

先吐槽再总结

代码实现部分必须得吐槽一下，编写代码花了2天，调试bug居然也花了2天，可能还是个TensorFlow新手的原因吧(自我安慰一下)。吐槽的背后还是需要自己深思反省一下的。
1.实现搭建多层神经网络的时候一定得先明确好神经网络的架构，该NN中有哪些层，每一层的输入和输出是什么,其中神经元的激励函数是什么，每一层的参数和偏置项是什么。一定需要先规划好，不然后面调试会很痛苦！！！
2.文本的数据预处理过程中，一定要仔细，各个类型间的转换都得提前思考好，构造训练和测试数据集的时候可以先写好训练数据的Demo.

代码编写过程一定要流程化，首先，然后，最后，不然调试的时候找bug简直想吐血。

Step 1 搭建实验总体流程

text_cnn_main.py
1get paramater—2load data—3create TextCNN model—4start train—5validataion

# 1 get paramater
parse = argparse.ArgumentParser(description='Paramaters for construct TextCNN Model')
# #方式一 type = bool
# parse.add_argument('--nonstatic',type=ast.literal_eval,help='use textcnn nonstatic or not',dest='tt')
# 方式二 取bool值的方式)添加互斥的参数
group_static = parse.add_mutually_exclusive_group(required=True)
group_static.add_argument('--static', dest='static_flag', action='store_true', help='use static Text_CNN')
group_static.add_argument('--nonstatic', dest='static_flag', action='store_false', help='use nonstatic Text_CNN')

group_word_vec = parse.add_mutually_exclusive_group(required=True)
group_word_vec.add_argument('--word2vec', dest='wordvec_flag', action='store_true', help='word_vec is word2vec')
group_word_vec.add_argument('--rand', dest='wordvec_flag', action='store_false', help='word_vec is rand')

group_shuffer_batch = parse.add_mutually_exclusive_group(required=False)
group_shuffer_batch.add_argument('--shuffer', dest='shuffer_flag', action='store_true', help='the train do shuffer')
group_shuffer_batch.add_argument('--no-shuffer', dest='shuffer_flag', action='store_false',
help='the train do not shuffer')

parse.add_argument('--learnrate', type=float, dest='learnrate', help='the NN learnRate', default=0.05)
parse.add_argument('--epochs', type=int, dest='epochs', help='the model train epochs', default=10)
parse.add_argument('--batch_size', type=int, dest='batch_size', help='the train gd batch size.(50-300)', default=50)
parse.add_argument('--dropout_pro', type=float, dest='dropout_pro', help='the nn layer dropout_pro', default=0.5)

parse.set_defaults(static_flag=True)
parse.set_defaults(wordvec_flag=True)
parse.set_defaults(shuffer_flag=False)

args = parse.parse_args()

# 2 load data
print('load data. . .')
X = pickle.load(open('./NLP/result/word_vec.p','rb'))

word_vecs_rand, word_vecs, word_cab, sentence_max_len, revs = X[0],X[1],X[2],X[3],X[4]

print('load data finish. . .')
# configuration tf
filter_sizes = [3, 4, 5]
filter_numbers = 100
embedding_size = 300
# use word2vec or not
W = word_vecs_rand
if args.wordvec_flag:
W = word_vecs
pass
# pdb.set_trace()
word_ids,W_list = process_data.getWordsVect(W)

# use static train or not
static_falg = args.static_flag
# use shuffer the data or not
shuffer_falg = args.shuffer_flag
#交叉验证
results = []
for index in tqdm(range(10)):
#打调试断点
# pdb.set_trace()
# train_x, train_y, test_x, test_y = process_data.get_train_test_data1(W,revs,index,sentence_max_len,default_values=0.0,vec_size=300)
train_x, train_y, test_x, test_y = process_data.get_train_test_data2(word_ids,revs,index,sentence_max_len)
# 3 create TextCNN model
text_cnn = TextCNN(W_list,shuffer_falg,static_falg,filter_numbers,filter_sizes,sentence_max_len,embedding_size,args.learnrate,args.epochs,args.batch_size,args.dropout_pro)
# 4 start train
text_cnn.train(train_x,train_y)
# 5 validataion
accur,loss = text_cnn.validataion(test_x, test_y)
#
results.append(accur)
print('cv {} accur is :{:.3f} loss is {:.3f}'.format(index+1,accur,loss))
text_cnn.close()
print('last accuracy is {}'.format(np.mean(results)))

Step 2 参数说明

使用的是argparse解析的终端参数
示例：python ./NLP/Text_CNN/text_cnn_main.py --nonstatic --word2vec

Paramaters for construct TextCNN Model
optional arguments:
-h, --help            show this help message and exit
--static              use static Text_CNN
--nonstatic           use nonstatic Text_CNN
--word2vec            word_vec is word2vec
--rand                word_vec is rand
--shuffer             the train do shuffer
--no-shuffer          the train do not shuffer
--learnrate LEARNRATE
the NN learnRate
--epochs EPOCHS       the model train epochs
--batch_size BATCH_SIZE
the train gd batch size.(50-300)
--dropout_pro DROPOUT_PRO
the nn layer dropout_pro

Step 3 数据处理

process_data.py 此处只不展示具体代码，具体代码查看github地址。

从二进制文件中加载数据集，并设置好每条review对应的label和cv中的类别。

def load_data_k_cv(folder,cv=10,clear_flag=True)
参数说明：
folder:MR 二进制文件的地址
cv:K-fold CV 交叉验证的分属类别
clear_flag：是否替换掉特殊字符
返回值: 
word_cab=defaultdict(float),训练集中的词汇表及对应的频率计数。
revs = []，每条review对应的说明。
如revs[0]={"y": 1,
"text": 'I like this movie',
"num_words": 4,
"spilt": np.random.randint(0, cv)
}

2.加载Word2Vec预训练好的词向量二进制文件，使用的是Google News的语料库训练的.

# 加载文件过程参考的是word2vec.WordVectors.from_binary(fname, *args, **kwargs)方法
def load_binary_vec(fname, vocab)
参数说明：
fnmae:使用word2vec预先训练好的词向量的文件名
vocab:MR训练集中的词汇表
返回值: 
word_vecs = {}，MR训练集中的词在word2vec训练好的词向量表中对应的向量。

3.对于MR训练集中在语料库Google News没有出现的词的处理(未登录词处理)

def add_unexist_word_vec(w2v,vocab)
#将词汇表中没有embedding的词初始化()
:param w2v:经过word2vec训练好的词向量
:param vocab:总体要embedding的词汇表

4.构造模型训练的数据集即模型的输入，输出格式。
方式一： 直接输入每个句子中的词对应的词向量组成的矩阵[sentence_length,embedding_size],实验中使用review中最长的词长度作为CNN的固定sentence_length输入，不足的padding 0，—疑问2.2

1 2	input shape:[min_batch_size,sentence_length,embedding_size] output shape:[min_batch_size,label_size]

方式二： 直接输入的是每个句子中的词对应的word2vec词向量表中对应的词id,用于后面的tf.nn.embedding_lookup

1 2	input shape:[min_batch_size,sentence_length] output shape:[min_batch_size,label_size]

两种方式的比较：
方式一，数据集的输入较清晰，明确，作为TensorFlow中placeholder输入。对于CNN-nonstatic和CNN-rand难以调整。对CNN-static非常适用。
方式二，构造数据集困难，但对三种类型的model的代码编写非常方便。

1 2	def get_train_test_data1(word_vecs,revs,cv_id=0,sent_length = 56,default_values=0.,vec_size = 300) def get_train_test_data2(word_ids,revs,cv_id=0,sent_length = 56)

Step 4 CNN-rand/CNN-static/CNN-nonstatic模型搭建

text_cnn_model.py 基于TensorFlow实现的。(对应上述的方式二)
placeholder和Variable，一个是作为模型的样本输入通过feed_dict输入，一个作为模型训练的参数，当tf.Variable(trainable=false)不作为模型训练的参数，为true时作为模型训练的参数。此处便是CNN-static/CNN-nonstatic的设置项。

# setting graph
tf.reset_default_graph()
self.train_graph = tf.Graph()
with self.train_graph.as_default():
# 1 input layer
self.input_x = tf.placeholder(dtype=tf.int32,shape=[None,sentence_length],name='input_x')
self.input_y = tf.placeholder(dtype=tf.int32, shape=[None, 2], name='input_y')
self.dropout_pro = tf.placeholder(dtype=tf.float32, name='dropout_pro')
self.learning_rate = tf.placeholder(dtype=tf.float32, name='learning_rate')
self.l2_loss = tf.constant(0.0)
#方式二embedding_layer作为 输入placeholder
# self.embedding_layer = tf.placeholder(dtype=tf.float32, shape=[self.batch_size, sentence_length, embedding_size],
#                                       name='embedding_layer')
#2 embedding layer
with tf.name_scope('embedding_layer'):
train_bool = not self.__static_falg
# tf.convert_to_tensor(W_list,dtype=tf.float32)
# pdb.set_trace()
self.embedding_layer_W = tf.Variable(initial_value=W_list,dtype=tf.float32, trainable=train_bool, name='embedding_layer_W')
print("ssssssss")
self.embedding_layer_layer = tf.nn.embedding_lookup(self.embedding_layer_W, self.input_x)
self.embedding_layer_expand = tf.expand_dims(self.embedding_layer_layer, -1)

#3 conv layer + maxpool layer for each filer size
pool_layer_lst = []
for filter_size in filter_sizes:
max_pool_layer = self.__add_conv_layer(filter_size,filter_numbers)
pool_layer_lst.append(max_pool_layer)

# 4.full connect droput + softmax + l2
# combine all the max pool —— feature
with tf.name_scope('dropout_layer'):
# pdb.set_trace()
max_num = len(filter_sizes) * self.filter_numbers
h_pool = tf.concat(pool_layer_lst,name='last_pool_layer',axis=3)
pool_layer_flat = tf.reshape(h_pool,[-1,max_num],name='pool_layer_flat')
dropout_pro_layer = tf.nn.dropout(pool_layer_flat,self.dropout_pro,name='dropout')

with tf.name_scope('soft_max_layer'):
SoftMax_W = tf.Variable(tf.truncated_normal([max_num,2],stddev=0.01),name='softmax_linear_weight')
self.__variable_summeries(SoftMax_W)
# print('test1------------')
SoftMax_b = tf.Variable(tf.constant(0.1,shape=[2]),name='softmax_linear_bias')
self.__variable_summeries(SoftMax_b)
# print('test2------------')
self.l2_loss += tf.nn.l2_loss(SoftMax_W)
self.l2_loss += tf.nn.l2_loss(SoftMax_b)
# dropout_pro_layer_reshape = tf.reshape(dropout_pro_layer,[batch_size,-1])
self.softmax_values = tf.nn.xw_plus_b(dropout_pro_layer,SoftMax_W,SoftMax_b,name='soft_values')
# print ('++++++',self.softmax_values.shape)
self.predictions = tf.argmax(self.softmax_values,axis=1,name='predictions',output_type=tf.int32)

with tf.name_scope('loss'):
losses = tf.nn.softmax_cross_entropy_with_logits(logits=self.softmax_values,labels=self.input_y)
self.loss = tf.reduce_mean(losses) + 0.001 * self.l2_loss #lambda = 0.001
tf.summary.scalar('last_loss',self.loss)

with tf.name_scope('accuracy'):
correct_acc = tf.equal(self.predictions,tf.argmax(self.input_y,axis=1,output_type=tf.int32))

self.accuracy = tf.reduce_mean(tf.cast(correct_acc,'float'),name='accuracy')
tf.summary.scalar('accuracy',self.accuracy)

with tf.name_scope('train'):
optimizer = tf.train.AdamOptimizer(learning_rate=self.learning_rate)
# print('test1------------')
# pdb打个断点
# pdb.set_trace()
self.train_op = optimizer.minimize(self.loss)
# print('test2------------')
# init Variable
self.session = tf.InteractiveSession(graph=self.train_graph)
self.merged = tf.summary.merge_all()
self.train_writer = tf.summary.FileWriter('./NLP/log/text_cnn', graph=self.train_graph)

Step 5 模型训练和预测

主要是分betch给模型feed数据

def train(self,train_x,train_y):
self.session.run(tf.global_variables_initializer())
#迭代训练
for epoch in range(self.epochs):
# pdb.set_trace()
train_batch = self.__get_batchs(train_x, train_y, self.batch_size)
train_loss, train_acc, count = 0.0, 0.0, 0
for batch_i in range(len(train_x)//self.batch_size):
x,y = next(train_batch)
feed = {
self.input_x:x,
self.input_y:y,
self.dropout_pro:self.dropout_pro_item,
self.learning_rate:self.learning_rate_item
}
_,summarys,loss,accuracy = self.session.run([self.train_op,self.merged,self.loss,self.accuracy],feed_dict=feed)
train_loss, train_acc, count = train_loss + loss, train_acc + accuracy, count + 1
self.train_writer.add_summary(summarys,epoch)
# each 5 batch print log
if (batch_i+1) % 15 == 0:
print('Epoch {:>3} Batch {:>4}/{} train_loss = {:.3f} accuracy = {:.3f}'.
format(epoch,batch_i,(len(train_x)//self.batch_size),train_loss/float(count),train_acc/float(count)))

参考链接

1. Convolutional Neural Networks for Sentence Classification
2. A Sensitivity Analysis of (and Practitioners’ Guide to) Convolutional Neural Networks for Sentence Classification
3. A Neural Probabilistic Language Model
4. 卷积神经网络(CNN)在句子建模上的应用