雁声远过潇湘去
十二楼中月自明
前言
计算机视觉大作业,复现一篇视觉相关的论文,鉴于本人之前对图像风格迁移比较感兴趣,毕竟世界上唯有两样东西能让我们的内心受到深深的震撼,一是我们头顶浩瀚灿烂的星空,二是我们心中崇高的道德法则,哲学与艺术是我们永恒的追求,这是我在无聊的日常中为数不多的享受之一了。还在苏大的时候,我便妄想于琐碎的日子里学习一门艺术,陶冶性情,也曾短暂拿起画笔,不过最终也还是一江春水了。讽刺的是现在倒也可以通过深度学习来临摹大师的作品了,于是我就选择了那篇经典的 Gatys L A, Ecker A S, Bethge M. A neural algorithm of artistic style[J]. arXiv preprint arXiv:1508.06576, 2015. 进行复现。
原理
VGG16
VGG16是由牛津大学提出的一种卷积神经网络结构,在图像识别分类上效果很好,在ImageNet上准确率可达92.7%。如图所示,VGG16拥有6个模块,依次进行卷积+池化,最后进行softmax,本文使用VGG16作为内容图像和风格图像的特征提取器。
风格迁移
风格迁移如上图所示,一张图提供内容,是内容图;一张图提供风格,是风格图;迁移即是将风格图的风格与内容图相融合,得到具有相同艺术风格的内容图。
如上图所示,将内容图和风格图输入VGG16网络中,观察每一层的输出,可发现VGG16低层关注的是图片的基本像素内容,高层则更关注图片的整体内容特征。因此假定艺术风格是低层的像素内容,即色彩、笔调等等无法量化的标准;而内容则是高层的特征,比如边界、形状、物体等等。所以将VGG16网络的几个模块的第一卷积层输出作为风格特征,最后的高层卷积层输出作为内容特征,分别进行风格重构和内容重构(观察输出可知,在网络的高层,图像的像素信息是被丢失了的,只保留了整体内容;网络的低层,像素信息很完整)。当然,重构的图很难既满足丢失的内容少,同时风格又足够类似风格图。因此分别计算重构图与原风格图、内容图的损失值,得到总损失,进行最小化损失值训练,不断调整输入(风格特征从噪声纹理开始),直到输入满足输出要求。
风格损失
其中,将VGG16的conv1_1、conv2_1、conv3_1、conv4_1、conv5_1层输出作为风格特征,则风格损失的计算如下:
\[L_{style} ( \vec{a}, \vec{x} ) = \sum_{l=0}^Lw_lE_l\]其中,\(\vec{a}\)、\(\vec{x}\)分别是原始图片和重构图片向量,\(l\)则是VGG16网络的特定层,\(E_l\)定义如下:
\[E_l = {1 \over{4N_l^2M_l^2}} \sum_{i,j} (G_{ij}^l - A_{ij}^l)^2\]其中\(G、A\)分别是指定层原始图片和重构图片的特征向量的gram矩阵的内积:
\[G_{ij}^l = \sum_k F_{ik}^l F_{jk}^l\]PS:说的悬乎,这玩意儿就是指定层输出向量和它的转置向量的内积
由此,每次进行反向传播时,都可以对输入权重进行调整,直到最后重构风格符合所需。
内容损失
将VGG16的conv5_2层输出作为内容特征,则内容损失如下:
\[L_{content} ( \vec{p}, \vec{x}, l ) = { 1 \over 2} \sum_{i,j} (F_{ij}^l - P_{ij}^l)^2\]总损失
\[L_{total}(\vec{p}, \vec{a}, \vec{x}) = \alpha L_{content}(\vec{p}, \vec{x}) + \beta L_{style}(\vec{a}, \vec{x})\]可以根据需要调整\(\alpha、\beta\)的值,保留不同比例的内容和风格。
实践
材料
- 内容图
- 风格图
出自印象派大师莫奈的《城堡》
实现
-
设置一些默认参数
content_img_path = 'test.jpg' style_img_path = 'style.jpg' width, height = load_img(content_img_path).size alpha, beta = 5, 100 mean_value = np.array([123.68, 116.779, 103.939]).reshape((1, 1, 1, 3))
-
载入VGG16模型
-
下载权重参数文件
wget http://www.vlfeat.org/matconvnet/models/beta16/imagenet-vgg-verydeep-19.mat
-
载入模型,并将最大池化层都替换为平均池化层,据说效果更好
def load_vgg_model(path): vgg = scipy.io.loadmat(path) vgg_layers = vgg['layers'] def _weights(layer, expected_layer_name): W = vgg_layers[0][layer][0][0][0][0][0] b = vgg_layers[0][layer][0][0][0][0][1] layer_name = vgg_layers[0][layer][0][0][3][0] return W, b def _conv2d_relu(prev_layer, layer, layer_name): W, b = _weights(layer, layer_name) W = tf.constant(W) b = tf.constant(np.reshape(b, (b.size))) return tf.nn.relu(tf.nn.conv2d(prev_layer, filter=W, strides=[1, 1, 1, 1], padding='SAME') + b) def _avgpool(prev_layer): return tf.nn.avg_pool(prev_layer, ksize=[1, 2, 2, 1], strides=[1, 2, 2, 1], padding='SAME') graph = {} graph['input'] = tf.Variable(np.zeros((1, height, width, 3)), dtype='float32') graph['conv1_1'] = _conv2d_relu(graph['input'], 0, 'conv1_1') graph['conv1_2'] = _conv2d_relu(graph['conv1_1'], 2, 'conv1_2') graph['avgpool1'] = _avgpool(graph['conv1_2']) graph['conv2_1'] = _conv2d_relu(graph['avgpool1'], 5, 'conv2_1') graph['conv2_2'] = _conv2d_relu(graph['conv2_1'], 7, 'conv2_2') graph['avgpool2'] = _avgpool(graph['conv2_2']) graph['conv3_1'] = _conv2d_relu(graph['avgpool2'], 10, 'conv3_1') graph['conv3_2'] = _conv2d_relu(graph['conv3_1'], 12, 'conv3_2') graph['conv3_3'] = _conv2d_relu(graph['conv3_2'], 14, 'conv3_3') graph['conv3_4'] = _conv2d_relu(graph['conv3_3'], 16, 'conv3_4') graph['avgpool3'] = _avgpool(graph['conv3_4']) graph['conv4_1'] = _conv2d_relu(graph['avgpool3'], 19, 'conv4_1') graph['conv4_2'] = _conv2d_relu(graph['conv4_1'], 21, 'conv4_2') graph['conv4_3'] = _conv2d_relu(graph['conv4_2'], 23, 'conv4_3') graph['conv4_4'] = _conv2d_relu(graph['conv4_3'], 25, 'conv4_4') graph['avgpool4'] = _avgpool(graph['conv4_4']) graph['conv5_1'] = _conv2d_relu(graph['avgpool4'], 28, 'conv5_1') graph['conv5_2'] = _conv2d_relu(graph['conv5_1'], 30, 'conv5_2') graph['conv5_3'] = _conv2d_relu(graph['conv5_2'], 32, 'conv5_3') graph['conv5_4'] = _conv2d_relu(graph['conv5_3'], 34, 'conv5_4') graph['avgpool5'] = _avgpool(graph['conv5_4']) return graph
-
-
读取图片、保存图片
def preprocess_image(image_path): img = load_img(image_path, target_size=(height, width)) img = img_to_array(img) img = np.expand_dims(img, axis=0) img = preprocess_input(img) img = img - mean_value return img def save_image(path, image): image = image + mean_value image = image[0] image = np.clip(image, 0, 255).astype('uint8') scipy.misc.imsave(path, image)
-
生成噪声输入图
def generate_noise_image(content_image, noise_ratio=0.75): noise_image = np.random.uniform(-20, 20, (1, height, width, 3)).astype('float32') input_image = noise_image * noise_ratio + content_image * (1 - noise_ratio) return input_image
-
内容损失
def content_loss(sess, model): init = model['conv5_2'] aim = sess.run(model['conv5_2']) n = aim.shape[3] m = aim.shape[1] loss = (1 / 2) * tf.reduce_sum(tf.pow(init - aim, 2)) return loss
-
风格损失
def gram_matrix(F, N, M): Ft = tf.reshape(F, (M, N)) return tf.matmul(tf.transpose(Ft), Ft) def get_style_loss(a, x): N = a.shape[3] M = a.shape[1] * a.shape[2] A = gram_matrix(a, N, M) G = gram_matrix(x, N, M) return (1 / (4 * N ** 2 * M ** 2)) * tf.reduce_sum(tf.pow(G - A, 2)) def style_loss(sess, model): conv1_1_style = sess.run(model['conv1_1']) loss1_1 = get_style_loss(conv1_1_style,model['conv1_1']) conv2_1_style = sess.run(model['conv2_1']) loss2_1 = get_style_loss(conv2_1_style,model['conv2_1']) conv3_1_style = sess.run(model['conv3_1']) loss3_1 = get_style_loss(conv3_1_style,model['conv3_1']) conv4_1_style = sess.run(model['conv4_1']) loss4_1 = get_style_loss(conv4_1_style,model['conv4_1']) conv5_1_style = sess.run(model['conv5_1']) loss5_1 = get_style_loss(conv5_1_style,model['conv5_1']) loss = loss1_1 * 0.5 + loss2_1 * 1.0 + loss3_1 * 1.5 + \ loss4_1 * 3.0 + loss5_1 * 4.0 return loss
-
训练
with tf.Session() as sess: content_img = preprocess_image(content_img_path) style_img = preprocess_image(style_img_path) model = load_vgg_model('imagenet-vgg-verydeep-19.mat') input_img = generate_noise_image(content_img) sess.run(tf.global_variables_initializer()) sess.run(model['input'].assign(content_img)) content_loss = content_loss(sess, model) sess.run(model['input'].assign(style_img)) style_loss = style_loss(sess, model) loss = alpha * content_loss + beta * style_loss optimizer = tf.train.AdamOptimizer(2.0) train = optimizer.minimize(loss) sess.run(tf.global_variables_initializer()) sess.run(model['input'].assign(input_img)) for i in range(3000): sess.run(train) if i % 100 == 0: output_img = sess.run(model['input']) print('Iteration %d' % i) print('Cost: ', sess.run(loss)) save_image('Iteration'+str(i)+'.jpg', output_img)
最终结果:
对比:
后记
总结
- 此方法效果不错,但是所需计算资源较高,每次迁移都是一次训练过程,没有固定住风格
- 低层的风格(像素)特征可以和高层的内容特征分离处理
- 艺术风格究竟是什么,是色彩、笔触、线条的杂糅,还是画家背后的精神寄托,从现代计算机的角度来看仅仅只是特征的糅合罢了,不过真正的艺术还是每个时代的艺术家们歇斯底里的情感表达与时代大潮相互裹挟的精神呐喊,也希望我自己不要在琐碎的日常里,忘记晚来天欲雪,能饮一杯无的诗意,才是我本来追求的生活