Please wait a minute...
Frontiers of Computer Science

ISSN 2095-2228

ISSN 2095-2236(Online)

CN 10-1014/TP

Postal Subscription Code 80-970

2018 Impact Factor: 1.129

Front. Comput. Sci.    2024, Vol. 18 Issue (1) : 181304    https://doi.org/10.1007/s11704-022-2385-x
Artificial Intelligence
CRD-CGAN: category-consistent and relativistic constraints for diverse text-to-image generation
Tao HU1,2,3, Chengjiang LONG4, Chunxia XIAO2()
1. College of Intelligent Systems Science and Engineering, Hubei Minzu University, Enshi 445000, China
2. School of Computer Science, Wuhan University, Wuhan 430072, China
3. Key Laboratory of Performing Art Equipment & System Technology, Ministry of Culture and Tourism, Beijing 100007, China
4. Meta Reality Labs, Burlingame, CA, 94010, USA
 Download: PDF(19172 KB)   HTML
 Export: BibTeX | EndNote | Reference Manager | ProCite | RefWorks
Abstract

Generating photo-realistic images from a text description is a challenging problem in computer vision. Previous works have shown promising performance to generate synthetic images conditional on text by Generative Adversarial Networks (GANs). In this paper, we focus on the category-consistent and relativistic diverse constraints to optimize the diversity of synthetic images. Based on those constraints, a category-consistent and relativistic diverse conditional GAN (CRD-CGAN) is proposed to synthesize K photo-realistic images simultaneously. We use the attention loss and diversity loss to improve the sensitivity of the GAN to word attention and noises. Then, we employ the relativistic conditional loss to estimate the probability of relatively real or fake for synthetic images, which can improve the performance of basic conditional loss. Finally, we introduce a category-consistent loss to alleviate the over-category issues between K synthetic images. We evaluate our approach using the Caltech-UCSD Birds-200-2011, Oxford 102 flower and MS COCO 2014 datasets, and the extensive experiments demonstrate superiority of the proposed method in comparison with state-of-the-art methods in terms of photorealistic and diversity of the generated synthetic images.

Keywords text-to-image      diverse conditional GAN      relativistic category-consistent     
Corresponding Author(s): Chunxia XIAO   
Just Accepted Date: 28 November 2022   Issue Date: 14 March 2023
 Cite this article:   
Tao HU,Chengjiang LONG,Chunxia XIAO. CRD-CGAN: category-consistent and relativistic constraints for diverse text-to-image generation[J]. Front. Comput. Sci., 2024, 18(1): 181304.
 URL:  
https://academic.hep.com.cn/fcs/EN/10.1007/s11704-022-2385-x
https://academic.hep.com.cn/fcs/EN/Y2024/V18/I1/181304
Fig.1  Illustration of three methods to generate a synthetic images conditional on a text. Our goal is to generate a set of diverse and high-quality synthetic images that are as consistent as possible with the text description and consistent with the category visual feature of real image category
Fig.2  Overview of our proposed framework CRD-CGAN. We use the generators Gi1,...,Gik to generate K synthetic images. Based on the difference between image space and latent noise space, we use the diverse constraints to improve the diversity of K synthetic images. The proposed category-consistent and relativistic constraints are reflected in the discriminator at each stage. In this figure, the green arrow is the relativistic loss between synthetic image with the corresponding real images based on the true label, and the blue arrow is the relativistic loss between real image with the corresponding synthetic images based on the fake label. Among these four groups of images (inner circle are real images and outsider are the corresponding synthetic images), the three ground above the decision boundary have better category consistency
Notation Meaning in CRD-CGAN
i The stage of the tree-liked stacked GANs
c text condition parameter
zk The k-th noise
sk The k-th synthetic image
Xi Real image from distribution Pdata at the stage i
Xf Real Image
Xr fake Image
GiK The K generators at the stage i
Di The discriminator at the stage i
lf,lr The symmetric labels {?1,1}
categoryi The true category of Xi
Tab.1  The parameters in CRD-CGAN
Fig.3  Illustration of four different kinds of discrimination regularization. (a) is the basic discrimination regularization, which just discriminates whether the synthetic image is true. (b) is the proposed relativistic discrimination regularization, which adds the relativistic average conditional loss on the basis of (a). (c) is the proposed category-consistent discrimination regularization, which combines the category consistency loss on the basis of (a). (d) is the proposed category-consistent and relativistic discrimination regularization, which introduce the category consistency loss and the relativistic average conditional loss on the basis of (a). The green arrow is the relativistic loss from synthetic image to the corresponding real images based on the true label. The blue arrow is the relativistic loss from real image to the corresponding synthetic images based on the fake label
Methods FID LPIPS User study
StackGAN++ 27.90±0.02 31.37%±3.17 12.17%±1.05
MSGAN 27.48±0.38 36.87%±0.68 15.07%±5.62
AttnGAN 23.81±0.53 35.26%±0.05 12.38%±3.07
D-CGAN-S 26.41±0.48 37.12%±1.97 19.72%±3.20
D-CGAN-A 22.61±0.13 38.67%±0.49 26.49%±4.31
RD-CGAN 26.53±0.253 39.07%±0.31 24.73%±4.38
CD-CGAN 28.25±0.16 39.12%±0.28 31.67%±2.93
CRD-CGAN 24.59±0.35 39.91%±0.34 33.24%±5.47
Tab.2  Diversity performance comparison on the Caltech-UCSD Birds-200-2011 dataset
Fig.4  Visualization of K=5 high-resolution and photo-realistic synthetic images conditioned on a text, and comparison with state-of-the-art methods (top) on the Caltech-UCSD Birds-200-2011 dataset
Methods FID LPIPS User study
StackGAN++ 64.13±0.88 23.47%±1.63 7.47%±0.92
MSGAN 61.95±0.23 32.09%±0.29 16.07%±4.31
AttnGAN 42.41±0.19 33.04%±0.84 16.53%±1.90
D-CGAN-S 45.03±1.07 33.50%±0.13 19.36%±1.41
D-CGAN-A 33.11±0.11 33.16%±0.81 22.51%±4.56
RD-CGAN 42.76±0.23 33.31%±0.80 23.69%±3.57
CD-CGAN 43.71±0.19 34.82%±0.79 30.11%±3.14
CRD-CGAN 40.75±0.32 37.56%±0.15 37.38%±3.07
Tab.3  Diverse Performance comparison on the Oxford 102 flower dataset
Fig.5  Visualization of K=5 high-resolution and photo-realistic synthtic images condtioned on a text, and compared with the corresponding real images (top) on the Oxford 102 flower dataset
Fig.6  K=3 synthetic images generated conditioned on the text “A small colorful bird with a white belly, and a black chest, head, and tail.” with the top-5 word attention maps. The results generated by AttnGAN are in blue rectangle, and the results generated by CRD-CGAN are in red rectangle, respectively
Methods CUB Oxford
StackGAN++ 4.02±0.58 2.49±0.02
MSGAN 4.28±0.05 3.25±0.30
AttnGAN 4.31±0.68 3.36±0.02
HDGAN 4.15±0.05 3.45±0.07
CTGAN 4.23±0.05 3.71±0.06
D-CGAN-S 4.29±0.07 3.29±0.08
D-CGAN-A 4.51±0.04 3.39±0.02
RD-CGAN 4.54±0.06 3.48±0.03
CD-CGAN 4.84±0.11 3.50±0.02
CRD-CGAN 4.75±0.10 3.53±0.06
Tab.4  Inception score comparison on the CUB and the Oxford
Methods CUB Oxford
StackGAN++ 10.57±4.83 13.66±1.44
MSGAN 16.08±5.12 18.67±1.73
AttnGAN 67.82±4.43 45.50±1.25
HDGAN 68.59±1.33 44.46 ±1.54
CTGAN 69.07±1.50 45.99±1.62
D-CGAN-S 67.33±4.85 20.13±0.98
D-CGAN-A 68.96±3.17 46.54±1.56
RD-CGAN 70.41±3.28 56.88±2.72
CD-CGAN 70.62±2.92 47.12±1.94
CRD-CGAN 71.17±2.36 47.70±2.22
Tab.5  R-Precision score comparison on the CUB and the Oxford
Methods FID LPIPS User study
AttnGAN 42.16±0.01 42.06%±0.21 12.41%±0.70
CPGAN 55.82±0.52 ? 16.24%±0.50
PPGAN 43.77±0.13 ? 14.86%±0.34
D-CGAN-A 39.35±0.02 41.49%±0.23 16.05%±0.23
RD-CGAN 38.61±0.10 42.16%±0.14 14.64%±0.47
CD-CGAN 43.31±0.02 42.18%±0.43 12.85%±0.49
CRD-CGAN 41.79±0.07 42.52%±0.46 19.45%±0.11
Tab.6  Diversity performance comparison on the MS COCO 2014 dataset
Fig.7  Visualization of K=5 high-resolution and photo-realistic synthtic images condtioned on a text, and comparing with the corresponding real images (top) on the MS COCO 2014 dataset
Fig.8  Examples of CRD-CGAN on the ability of catching words changes (underline word in red) of the text description on the Caltech-UCSD Birds-200-2011 dataset (top), on the Oxford 102 flower dataset (middle), and on MS COCO 2014 dataset (below)
  
  
  
1 T, Hu C, Long C Xiao . A novel visual representation on text using diverse conditional GAN for visual recognition. IEEE Transactions on Image Processing, 2021, 30: 3499–3512
2 C, Long R, Collins E, Swears A Hoogs . Deep neural networks in fully connected CRF for image labeling with social network metadata. In: Proceedings of 2019 IEEE Winter Conference on Applications of Computer Vision. 2019, 1607–1615
3 C, Long G, Hua A Kapoor . Active visual recognition with expertise estimation in crowdsourcing. In: Proceedings of 2013 IEEE International Conference on Computer Vision. 2013, 3000–3007
4 G, Hua C, Long M, Yang Y Gao . Collaborative active learning of a kernel machine ensemble for recognition. In: Proceedings of 2013 IEEE International Conference on Computer Vision. 2013, 1209–1216
5 C, Long G Hua . Multi-class multi-annotator active learning with robust gaussian process for visual recognition. In: Proceedings of 2015 IEEE International Conference on Computer Vision. 2015, 2839–2847
6 C, Long G, Hua A Kapoor . A joint gaussian process model for active visual recognition with expertise estimation in crowdsourcing. International Journal of Computer Vision, 2016, 116( 2): 136–160
7 C, Long G Hua . Correlational Gaussian processes for cross-domain visual recognition. In: Proceedings of 2017 IEEE Conference on Computer Vision and Pattern Recognition. 2017, 118–126
8 G, Hua C, Long M, Yang . et al.. Collaborative active visual recognition from crowds: a distributed ensemble approach. IEEE Transactions on Pattern Analysis and Machine Intelligence, 2018, 40( 3): 582–594
9 Y, Wang Y, Wei X, Qian L, Zhu Y Yang . Sketch-guided scenery image outpainting. IEEE Transactions on Image Processing, 2021, 30: 2643–2655
10 I, Goodfellow J, Pouget-Abadie M, Mirza B, Xu D, Warde-Farley S, Ozair A, Courville Y Bengio . Generative adversarial nets. Communications of the ACM, 2020, 63( 11): 139–144
11 M, Mirza S Osindero . Conditional generative adversarial nets. 2014, arXiv preprint arXiv: 1411.1784
12 S E, Reed Z, Akata S, Mohan S, Tenka B, Schiele H Lee . Learning what and where to draw. In: Proceedings of the 30th International Conference on Neural Information Processing Systems. 2016, 217–225
13 S, Reed Z, Akata X, Yan L, Logeswaran B, Schiele H Lee . Generative adversarial text to image synthesis. In: Proceedings of the 33rd International Conference on International Conference on Machine Learning. 2016, 1060–1069
14 C, Ledig L, Theis F, Huszár J, Caballero A, Cunningham A, Acosta A, Aitken A, Tejani J, Totz Z, Wang W Shi . Photo-realistic single image super-resolution using a generative adversarial network. In: Proceedings of 2017 IEEE Conference on Computer Vision and Pattern Recognition. 2017, 4681–4690
15 H, Zhang T, Xu H, Li S, Zhang X, Wang X, Huang D Metaxas . StackGAN: text to photo-realistic image synthesis with stacked generative adversarial networks. In: Proceedings of 2017 IEEE International Conference on Computer Vision. 2017, 5907–5915
16 H, Zhang T, Xu H, Li S, Zhang X, Wang X, Huang D N Metaxas . StackGAN++: realistic image synthesis with stacked generative adversarial networks. IEEE Transactions on Pattern Analysis and Machine Intelligence, 2019, 41( 8): 1947–1962
17 H, Zhang I J, Goodfellow D N, Metaxas A Odena . Self-attention generative adversarial networks. In: Proceedings of the 36th International Conference on Machine Learning. 2019, 7354–7363
18 T, Xu P, Zhang Q, Huang H, Zhang Z, Gan X, Huang X He . AttnGAN: fine-grained text to image generation with attentional generative adversarial networks. In: Proceedings of 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition. 2018, 1316–1324
19 Q, Mao H Y, Lee H Y, Tseng S, Ma M S Yang . Mode seeking generative adversarial networks for diverse image synthesis. In: Proceedings of 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition. 2019, 1429–1437
20 G, Yin B, Liu L, Sheng N, Yu X, Wang J Shao . Semantics disentangling for text-to-image generation. In: Proceedings of 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition. 2019, 2327–2336
21 M, Cha Y L, Gwon H T Kung . Adversarial learning of semantic relevance in text to image synthesis. In: Proceedings of the AAAI Conference on Artificial Intelligence. 2019, 3272–3279
22 F, Tan S, Feng V Ordonez . Text2Scene: generating compositional scenes from textual descriptions. In: Proceedings of 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition. 2019, 6710–6719
23 Y, Li Z, Gan Y, Shen J, Liu Y, Cheng Y, Wu L, Carin D, Carlson J Gao . StoryGAN: a sequential conditional GAN for story visualization. In: Proceedings of 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition. 2019, 6322–6331
24 W, Li P, Zhang L, Zhang Q, Huang X, He S, Lyu J Gao . Object-driven text-to-image synthesis via adversarial training. In: Proceedings of 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition. 2019, 12166–12174
25 H, Eghbal-Zadeh W, Zellinger G Widmer . Mixture density generative adversarial networks. In: Proceedings of 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition. 2019, 5813–5822
26 J, Cheng F, Wu Y, Tian L, Wang D Tao . RiFeGAN: rich feature generation for text-to-image synthesis from prior knowledge. In: Proceedings of 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition. 2020, 10908–10917
27 J, Liang W, Pei F Lu . CPGAN: content-parsing generative adversarial networks for text-to-image synthesis. In: Proceedings of the 16th European Conference on Computer Vision. 2020, 491–508
28 J Y, Koh J, Baldridge H, Lee Y Yang . Text-to-image generation grounded by fine-grained user attention. In: Proceedings of 2021 IEEE Winter Conference on Applications of Computer Vision. 2021, 237–246
29 L, Gao D, Chen Z, Zhao J, Shao H T Shen . Lightweight dynamic conditional GAN with pyramid attention for text-to-image synthesis. Pattern Recognition, 2021, 110: 107384
30 Y, Yang L, Wang D, Xie C, Deng D Tao . Multi-sentence auxiliary adversarial networks for fine-grained text-to-image synthesis. IEEE Transactions on Image Processing, 2021, 30: 2798–2809
31 D M, Arroyo J, Postels F Tombari . Variational transformer networks for layout generation. In: Proceedings of 2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition. 2021, 13637–13647
32 F, Fang Z, Li F, Luo C Xiao . Discriminator modification in GAN for text-to-image generation. In: Proceedings of 2022 IEEE International Conference on Multimedia and Expo. 2022, 1–6
33 F, Fang Z, Li F, Luo C, Long S, Hu C Xiao . PhraseGAN: phrase-boost generative adversarial network for text-to-image generation. In: Proceedings of 2022 IEEE International Conference on Multimedia and Expo. 2022, 1–6
34 T, Park M Y, Liu T C, Wang J Y Zhu . Semantic image synthesis with spatially-adaptive normalization. In: Proceedings of 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition. 2019, 2332–2341
35 M, Hu J, Li M, Hu T Hu . Hierarchical modes exploring in generative adversarial networks. In: Proceedings of the AAAI Conference on Artificial Intelligence. 2020, 10981–10988
36 Z, Liu J, Wang Z Liang . CatGAN: category-aware generative adversarial networks with hierarchical evolutionary learning for category text generation. In: Proceedings of the AAAI Conference on Artificial Intelligence. 2020, 8425–8432
37 X, Huang Y, Li O, Poursaeed J, Hopcroft S Belongie . Stacked generative adversarial networks. In: Proceedings of 2017 IEEE Conference on Computer Vision and Pattern Recognition. 2017, 1866–1875
38 Wah C, Branson S, Welinder P, Perona P, Belongie S. The caltech-UCSD birds-200-2011 dataset. California Institute of Technology. CNS-TR-2010-001. 2011
39 Nilsback M E, Zisserman A. Automated flower classification over a large number of classes. In: Proceedings of the 6th Indian Conference on Computer Vision, Graphics & Image Processing. 2008, 722–729
40 T Y, Lin M, Maire S, Belongie J, Hays P, Perona D, Ramanan P, Dollár C L Zitnick . Microsoft COCO: common objects in context. In: Proceedings of the 13th European Conference on Computer Vision. 2014, 740–755
41 B, Ding C, Long L, Zhang C Xiao . ARGAN: attentive recurrent generative adversarial network for shadow detection and removal. In: Proceedings of 2019 IEEE/CVF International Conference on Computer Vision. 2019, 10212–10221
42 L, Zhang C, Long X, Zhang C Xiao . RIS-GAN: explore residual and illumination with generative adversarial networks for shadow removal. In: Proceedings of the AAAI Conference on Artificial Intelligence. 2020, 12829–12836
43 D, Liu C, Long H, Zhang H, Yu X, Dong C Xiao . ARShadowGAN: shadow generative adversarial network for augmented reality in single light scenes. In: Proceedings of 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition. 2020, 8136–8145
44 A, Islam C, Long A, Basharat A Hoogs . DOA-GAN: dual-order attentive generative adversarial network for image copy-move forgery detection and localization. In: Proceedings of 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition. 2020, 4675–4684
45 L, Zhang C, Long Q, Yan X, Zhang C Xiao . CLA-GAN: a context and lightness aware generative adversarial network for shadow removal. Computer Graphics Forum, 2020, 39( 7): 483–494
46 J, Zhang C, Long Y, Wang X, Yang H, Mei B Yin . Multi-context and enhanced reconstruction network for single image super resolution. In: Proceedings of 2020 IEEE International Conference on Multimedia and Expo. 2020, 1–6
47 B, Vasu C Long . Iterative and adaptive sampling with spatial attention for black-box model explanations. In: Proceedings of 2020 IEEE Winter Conference on Applications of Computer Vision. 2020, 2949–2958
48 J, Zhang C, Long Y, Wang H, Piao H, Mei X, Yang B Yin . A two-stage attentive network for single image super-resolution. IEEE Transactions on Circuits and Systems for Video Technology, 2022, 32( 3): 1020–1033
49 A, Islam C, Long R Radke . A hybrid attention mechanism for weakly-supervised temporal action localization. In: Proceedings of the AAAI Conference on Artificial Intelligence. 2021, 1637–1645
50 J, Wei C, Long H, Zou C Xiao . Shadow inpainting and removal using generative adversarial networks with slice convolutions. Computer Graphics Forum, 2019, 38( 7): 381–392
51 Z, Yang J, Dong P, Liu Y, Yang S Yan . Very long natural scenery image prediction by outpainting. In: Proceedings of 2019 IEEE/CVF International Conference on Computer Vision. 2019, 10560–10569
52 Z, Zheng L, Zheng Y Yang . Unlabeled samples generated by GAN improve the person re-identification baseline in vitro. In: Proceedings of 2017 IEEE International Conference on Computer Vision. 2017, 3774–3782
53 Z, Zheng X, Yang Z, Yu L, Zheng Y, Yang J Kautz . Joint discriminative and generative learning for person re-identification. In: Proceedings of 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition. 2019, 2133–2142
54 Wang X, Zhu L, Zheng Z, Xu M, Yang Y. Align and tell: boosting text-video retrieval with local alignment and fine-grained supervision. IEEE Transactions on Multimedia, 2022, 1-11
55 A, Shrivastava T, Pfister O, Tuzel J, Susskind W, Wang R Webb . Learning from simulated and unsupervised images through adversarial training. In: Proceedings of 2017 IEEE Conference on Computer Vision and Pattern Recognition. 2017, 2242–2251
56 J, Shi Y, Zhong N, Xu Y, Li C Xu . A simple baseline for weakly-supervised scene graph generation. In: Proceedings of 2021 IEEE/CVF International Conference on Computer Vision. 2021, 16373–16382
57 H, Zhang J Y, Koh J, Baldridge H, Lee Y Yang . Cross-modal contrastive learning for text-to-image generation. In: Proceedings of 2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition. 2021, 833–842
58 M, Arjovsky L Bottou . Towards principled methods for training generative adversarial networks. 2017, arXiv preprint arXiv: 1701, 0486, 2
59 A Jolicoeur-Martineau . The relativistic discriminator: a key element missing from standard GAN. In: Proceedings of the 7th International Conference on Learning Representations. 2019
60 A Jolicoeur-Martineau . On relativistic f-divergences. In: Proceedings of the 37th International Conference on Machine Learning. 2020, 458
61 X, Mao Q, Li H, Xie R Y K, Lau Z, Wang S P Smolley . Least squares generative adversarial networks. In: Proceedings of 2017 IEEE International Conference on Computer Vision. 2017, 2813–2821
62 A, Krizhevsky I, Sutskever G E Hinton . ImageNet classification with deep convolutional neural networks. Communications of the ACM, 2017, 60( 6): 84–90
63 S, Pattnaik A K Nayak . Summarization of odia text document using cosine similarity and clustering. In: Proceedings of 2019 International Conference on Applied Machine Learning. 2019, 143–146
64 M, Heusel H, Ramsauer T, Unterthiner B, Nessler S Hochreiter . GANs trained by a two time-scale update rule converge to a local Nash equilibrium. In: Proceedings of the 31st International Conference on Neural Information Processing Systems. 2017, 6629–6640
65 T, Salimans I, Goodfellow W, Zaremba V, Cheung A, Radford X Chen . Improved techniques for training GANs. In: Proceedings of the 30th International Conference on Neural Information Processing Systems. 2016, 2234–2242
66 R, Zhang P, Isola A A, Efros E, Shechtman O Wang . The unreasonable effectiveness of deep features as a perceptual metric. In: Proceedings of 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition. 2018, 586–595
67 Z, Zhang Y, Xie L Yang . Photographic text-to-image synthesis with a hierarchically-nested adversarial network. In: Proceedings of 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition. 2018, 6199–6208
68 D M, Souza J, Wehrmann D D Ruiz . Efficient neural architecture for text-to-image synthesis. In: Proceedings of 2020 International Joint Conference on Neural Networks. 2020, 1–8
69 A, Nguyen J, Clune Y, Bengio A, Dosovitskiy J Yosinski . Plug & play generative networks: conditional iterative generation of images in latent space. In: Proceedings of 2017 IEEE Conference on Computer Vision and Pattern Recognition. 2017, 3510–3520
[1] FCS-22385-OF-TH_suppl_1 Download
Viewed
Full text


Abstract

Cited

  Shared   
  Discussed