GANs是生成模型,它们学习从随机噪声向量 z z z到输出图像 y y y的映射, G : z → y G : z → y G:z→y [22]。相比之下,条件生成对抗网络学习从观察到的图像 x x x和随机噪声向量 z z z到 y y y的映射, G : { x , z } → y G : \{x, z\} → y G:{x,z}→y。
生成器 G G G的训练目标是生成的输出无法被经过对抗训练的鉴别器 D D D与"真实"图像区分开来,鉴别器 D D D的训练目标是尽可能地检测生成器的"伪造"图像。这个训练过程在图2中有所说明。
图2:训练条件生成对抗网络从边缘映射→照片的映射。鉴别器 D D D学习将伪造的(由生成器合成的)和真实的{边缘映射,照片}元组进行分类。生成器 G G G学习欺骗鉴别器。与无条件的GAN不同,生成器和鉴别器都观察输入的边缘映射。
3.1. 目标函数
条件生成对抗网络的目标函数可以表示为
L c G A N ( G , D ) = E x , y [ log D ( x , y ) ] + E x , z [ log ( 1 − D ( x , G ( x , z ) ) ) ] (1) L_{cGAN}(G, D) = E_{x,y}[\log D(x, y)] + E_{x,z}[\log(1 - D(x, G(x, z)))] \tag{1} LcGAN(G,D)=Ex,y[logD(x,y)]+Ex,z[log(1−D(x,G(x,z)))](1)
其中生成器 G G G试图最小化这个目标函数,对抗性的鉴别器 D D D试图最大化它,即 G ∗ = arg min G max D L c G A N ( G , D ) G^* = \arg\min_G \max_D L_{cGAN}(G, D) G∗=argminGmaxDLcGAN(G,D)。
为了测试条件鉴别器的重要性,我们还将其与不含条件的变种进行比较,其中鉴别器不观察 x x x:
L G A N ( G , D ) = E y [ log D ( y ) ] + E x , z [ log ( 1 − D ( G ( x , z ) ) ) ] (2) L_{GAN}(G, D) = E_{y}[\log D(y)] + E_{x,z}[\log(1 - D(G(x, z)))] \tag{2} LGAN(G,D)=Ey[logD(y)]+Ex,z[log(1−D(G(x,z)))](2)
L L 1 ( G ) = E x , y , z [ ∥ y − G ( x , z ) ∥ 1 ] (3) L_{L1}(G) = E_{x,y,z}[\|y - G(x, z)\|_1] \tag{3} LL1(G)=Ex,y,z[∥y−G(x,z)∥1](3)
我们的最终目标是
G ∗ = arg min G max D ( L c G A N ( G , D ) + λ L L 1 ( G ) ) (4) G^* = \arg\min_G \max_D (L_{cGAN}(G, D) + \lambda L_{L1}(G)) \tag{4} G∗=argGminDmax(LcGAN(G,D)+λLL1(G))(4)
在没有 z z z的情况下,网络仍然可以从 x x x到 y y y学习映射,但会产生确定性输出,因此无法匹配除了Delta函数以外的任何分布。过去的条件GAN已经承认了这一点,并且在生成器的输入中提供了高斯噪声 z z z,除了 x x x之外(例如,[52])。在初始实验中,我们发现这种策略并不有效 - 生成器只是学会忽略噪声 - 这与Mathieu等人的研究一致[37]。相反,对于我们的最终模型,我们只在生成器的几层上以Dropout的形式提供噪声,这在训练和测试时都会应用。尽管有Dropout的噪声,我们观察到网络的输出只有微小的随机性。设计能够产生高度随机输出的条件GAN,并因此捕捉所建模条件分布的全部熵,是目前工作中未解决的重要问题。
为了使生成器能够绕过瓶颈,以传递类似信息,我们添加了跳跃连接,遵循"U-Net" [47]的一般形状。具体来说,我们在每个层 i i i和层 n − i n - i n−i之间添加跳跃连接,其中 n n n是总层数。每个跳跃连接简单地将层 i i i和层 n − i n - i n−i中的所有通道连接起来。
这激发了将GAN鉴别器限制为仅模拟高频结构,依靠L1项来强制低频正确性(方程4)。为了模拟高频,将注意力限制在局部图像块的结构上就足够了。因此,我们设计了一个鉴别器架构 - 我们称之为PatchGAN - 它仅对图像块尺度上的结构进行惩罚。该鉴别器试图对图像中的每个 N × N N × N N×N块进行分类,判断其是否为真实的或伪造的。我们通过卷积在整个图像上运行此鉴别器,平均所有响应以提供 D D D的最终输出。
在第4.4节中,我们证明 N N N可以远小于图像的完整尺寸,仍然能够产生高质量的结果。这是有利的,因为较小的PatchGAN具有较少的参数,运行速度更快,可以应用于任意大的图像。
自论文和我们的pix2pix代码库首次发布以来,Twitter社区,包括计算机视觉和图形领域的从业者以及艺术家,已经成功地将我们的框架应用于多种新颖的图像到图像转换任务,远远超出了原始论文的范围。图10仅展示了#pix2pix标签下的一些示例,例如素描→肖像、"Do as I Do"姿势转移、深度→街景、背景去除、调色板生成、素描→宝可梦,以及广受欢迎的#edges2cats。
A. Buades, B. Coll, and J.-M. Morel. A non-local algorithm for image denoising. In CVPR, volume 2, pages 60--65. IEEE, 2005.
L.-C. Chen, G. Papandreou, I. Kokkinos, K. Murphy, and A. L. Yuille. Semantic image segmentation with deep convolutional nets and fully connected CRFs. In ICLR, 2015.
T. Chen, M.-M. Cheng, P. Tan, A. Shamir, and S.-M. Hu. Sketch2photo: internet image montage. ACM Transactions on Graphics (TOG), 28(5):124, 2009.
M. Cordts, M. Omran, S. Ramos, T. Rehfeld, M. Enzweiler, R. Benenson, U. Franke, S. Roth, and B. Schiele. The Cityscapes Dataset for semantic urban scene understanding. In CVPR, 2016.
E. L. Denton, S. Chintala, R. Fergus, et al. Deep generative image models using a Laplacian pyramid of adversarial networks. In NIPS, pages 1486--1494, 2015.
A. Dosovitskiy and T. Brox. Generating images with perceptual similarity metrics based on deep networks. arXiv preprint arXiv:1602.02644, 2016.
A. A. Efros and W. T. Freeman. Image quilting for texture synthesis and transfer. In SIGGRAPH, pages 341--346. ACM, 2001.
A. A. Efros and T. K. Leung. Texture synthesis by non-parametric sampling. In ICCV, volume 2, pages 1033--1038. IEEE, 1999.
D. Eigen and R. Fergus. Predicting depth, surface normals and semantic labels with a common multi-scale convolutional architecture. In Proceedings of the IEEE International Conference on Computer Vision, pages 2650--2658, 2015.
M. Eitz, J. Hays, and M. Alexa. How do humans sketch objects? SIGGRAPH, 31(4):44--1, 2012.
R. Fergus, B. Singh, A. Hertzmann, S. T. Roweis, and W. T. Freeman. Removing camera shake from a single photograph. In ACM Transactions on Graphics (TOG), volume 25, pages 787--794. ACM, 2006.
L. A. Gatys, A. S. Ecker, and M. Bethge. Texture synthesis and the controlled generation of natural stimuli using convolutional neural networks. arXiv preprint arXiv:1505.07376, 2015.
L. A. Gatys, A. S. Ecker, and M. Bethge. Image style transfer using convolutional neural networks. CVPR, 2016.
J. Gauthier. Conditional generative adversarial nets for convolutional face generation. Class Project for Stanford CS231N: Convolutional Neural Networks for Visual Recognition, Winter semester, 2014(5):2, 2014.
I. Goodfellow, J. Pouget-Abadie, M. Mirza, B. Xu, D. Warde-Farley, S. Ozair, A. Courville, and Y. Bengio. Generative adversarial nets. In NIPS, 2014.
A. Hertzmann, C. E. Jacobs, N. Oliver, B. Curless, and D. H. Salesin. Image analogies. In SIGGRAPH, pages 327--340. ACM, 2001.
G. E. Hinton and R. R. Salakhutdinov. Reducing the dimensionality of data with neural networks. Science, 313(5786):504--507, 2006.
S. Iizuka, E. Simo-Serra, and H. Ishikawa. Let there be Color!: Joint End-to-end Learning of Global and Local Image Priors for Automatic Image Colorization with Simultaneous Classification. ACM Transactions on Graphics (TOG), 35(4), 2016.
S. Ioffe and C. Szegedy. Batch normalization: Accelerating deep network training by reducing internal covariate shift. 2015.
J. Johnson, A. Alahi, and L. Fei-Fei. Perceptual losses for real-time style transfer and super-resolution. 2016.
L. Karacan, Z. Akata, A. Erdem, and E. Erdem. Learning to generate images of outdoor scenes from attributes and semantic layouts. arXiv preprint arXiv:1612.00215, 2016.
D. Kingma and J. Ba. Adam: A method for stochastic optimization. ICLR, 2015.
P.-Y. Laffont, Z. Ren, X. Tao, C. Qian, and J. Hays. Transient attributes for high-level understanding and editing of outdoor scenes. ACM Transactions on Graphics (TOG), 33(4):149, 2014.
A. B. L. Larsen, S. K. Sønderby, and O. Winther. Autoencoding beyond pixels using a learned similarity metric. arXiv preprint arXiv:1512.09300, 2015.
G. Larsson, M. Maire, and G. Shakhnarovich. Learning representations for automatic colorization. ECCV, 2016.
C. Ledig, L. Theis, F. Huszár, J. Caballero, A. Cunningham, A. Acosta, A. Aitken, A. Tejani, J. Totz, Z. Wang, et al. Photo-realistic single image super-resolution using a generative adversarial network. arXiv preprint arXiv:1609.04802, 2016.
C. Li and M. Wand. Combining Markov random fields and convolutional neural networks for image synthesis. CVPR, 2016.
C. Li and M. Wand. Precomputed real-time texture synthesis with Markovian generative adversarial networks. ECCV, 2016.
J. Long, E. Shelhamer, and T. Darrell. Fully convolutional networks for semantic segmentation. In CVPR, pages 3431--3440, 2015.
M. Mathieu, C. Couprie, and Y. LeCun. Deep multi-scale video prediction beyond mean square error. ICLR, 2016.
M. Mirza and S. Osindero. Conditional generative adversarial nets. arXiv preprint arXiv:1411.1784, 2014.
A. Owens, P. Isola, J. McDermott, A. Torralba, E. H. Adelson, and W. T. Freeman. Visually indicated sounds. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 2405--2413, 2016.
D. Pathak, P. Krahenbuhl, J. Donahue, T. Darrell, and A. A. Efros. Context encoders: Feature learning by inpainting. CVPR, 2016.
A. Radford, L. Metz, and S. Chintala. Unsupervised representation learning with deep convolutional generative adversarial networks. arXiv preprint arXiv:1511.06434, 2015.
R. S. Radim Tylecek. Spatial pattern templates for recognition of objects with regular structure. In Proc. GCPR, Saarbrucken, Germany, 2013.
S. Reed, Z. Akata, X. Yan, L. Logeswaran, B. Schiele, and H. Lee. Generative adversarial text to image synthesis. arXiv preprint arXiv:1605.05396, 2016.
S. Reed, A. van den Oord, N. Kalchbrenner, V. Bapst, M. Botvinick, and N. de Freitas. Generating interpretable images with controllable structure. Technical report, Technical report, 2016.
S. E. Reed, Z. Akata, S. Mohan, S. Tenka, B. Schiele, and H. Lee. Learning what and where to draw. In Advances In Neural Information Processing Systems, pages 217--225, 2016.
E. Reinhard, M. Ashikhmin, B. Gooch, and P. Shirley. Color transfer between images. IEEE Computer Graphics and Applications, 21:34--41, 2001.
O. Ronneberger, P. Fischer, and T. Brox. U-net: Convolutional networks for biomedical image segmentation. In MICCAI, pages 234--241. Springer, 2015.
O. Russakovsky, J. Deng, H. Su, J. Krause, S. Satheesh, S. Ma, Z. Huang, A. Karpathy, A. Khosla, M. Bernstein, et al. Imagenet large scale visual recognition challenge. IJCV, 115(3):211--252, 2015.
T. Salimans, I. Goodfellow, W. Zaremba, V. Cheung, A. Radford, and X. Chen. Improved techniques for training GANs. arXiv preprint arXiv:1606.03498, 2016.
Y. Shih, S. Paris, F. Durand, and W. T. Freeman. Data-driven hallucination of different times of day from a single outdoor photo. ACM Transactions on Graphics (TOG), 32(6):200, 2013.
D. Ulyanov, A. Vedaldi, and V. Lempitsky. Instance normalization: The missing ingredient for fast stylization. arXiv preprint arXiv:1607.08022, 2016.
X. Wang and A. Gupta. Generative image modeling using style and structure adversarial networks. ECCV, 2016.
Z. Wang, A. C. Bovik, H. R. Sheikh, and E. P. Simoncelli. Image quality assessment: from error visibility to structural similarity. IEEE Transactions on Image Processing, 13(4):600--612, 2004.
S. Xie, X. Huang, and Z. Tu. Top-down learning for structured labeling with convolutional pseudoprior. 2015.
S. Xie and Z. Tu. Holistically-nested edge detection. In ICCV, 2015.
D. Yoo, N. Kim, S. Park, A. S. Paek, and I. S. Kweon. Pixel-level domain transfer. ECCV, 2016.
A. Yu and K. Grauman. Fine-Grained Visual Comparisons with Local Learning. In CVPR, 2014.
R. Zhang, P. Isola, and A. A. Efros. Colorful image colorization. ECCV, 2016.
J. Zhao, M. Mathieu, and Y. LeCun. Energy-based generative adversarial network. arXiv preprint arXiv:1609.03126, 2016.
Y. Zhou and T. L. Berg. Learning temporal transformations from time-lapse videos. In ECCV, 2016.
J.-Y. Zhu, P. Kr¨ahenb¨uhl, E. Shechtman, and A. A. Efros. Generative visual manipulation on the natural image manifold. In ECCV, 2016.
rsarial network. arXiv preprint arXiv:1609.03126, 2016.
Y. Zhou and T. L. Berg. Learning temporal transformations from time-lapse videos. In ECCV, 2016.
J.-Y. Zhu, P. Kr¨ahenb¨uhl, E. Shechtman, and A. A. Efros. Generative visual manipulation on the natural image manifold. In ECCV, 2016.