1. Institute of Computer Science and Technology, Peking University, Beijing 100871, China 2. Department of Computer Science and Technology, Tsinghua University, Beijing 100084, China 3. Institute of Information Science, Beijing Jiaotong University, Beijing 100044, China 4. National Laboratory of Pattern Recognition, Institute of Automation, Chinese Academy of Sciences, Beijing 100190, China 5. Key Laboratory of Intelligent Information Processing, Institute of Computing Technology, Chinese Academy of Sciences, Beijing 100190, China 6. Department of Computer Science and Technology, Xi’an Jiaotong University, Xi’an 710049, China 7. School of Electronics Engineering and Computer Science, Peking University, Beijing 100871, China
Cross-media analysis and reasoning is an active research area in computer science, and a promising direction for artificial intelligence. However, to the best of our knowledge, no existing work has summarized the state-of-the-art methods for cross-media analysis and reasoning or presented advances, challenges, and future directions for the field. To address these issues, we provide an overview as follows: (1) theory and model for cross-media uniform representation; (2) cross-media correlation understanding and deep mining; (3) cross-media knowledge graph construction and learning methodologies; (4) cross-media knowledge evolution and reasoning; (5) cross-media description and generation; (6) cross-media intelligent engines; and (7) cross-media intelligent applications. By presenting approaches, advances, and future directions in cross-media analysis and rea-soning, our goal is not only to draw more attention to the state-of-the-art advances in the field, but also to provide technical insights by discussing the challenges and research directions in these areas.
Aamodt , A., Plaza , E., 1994. Case-based reasoning: founda-tional issues, methodological variations, and system ap-proaches.AI Commun., 7(1):39–59.
2
Adib , F., Hsu , C.Y., Mao , H., , 2015. Capturing the human figure through a wall.ACM Trans. Graph., 34(6):219.
3
Andrew , G., Arora , R., Bilmes , J., , 2013. Deep canonical correlation analysis. Int. Conf. on Machine Learning, p.1247–1255.
4
Antenucci , D., Li , E., Liu , S., , 2013. Ringtail: a gener-alized nowcasting system. Proc. VLDB Endow., 6(12): 1358–1361.
5
Antol , S., Agrawal , A., Lu , J., , 2015. VQA: visual ques-tion answering. IEEE Int. Conf. on Computer Vision, p.2425–2433.
6
Babenko , A., Slesarev , A., Chigorin , A., , 2014. Neural codes for image retrieval. European Conf. on Computer Vision, p.584–599.
7
Brownson , R.C., Gurney , J.G., Land , G.H., 1999. Evidence- based decision making in public health.J. Publ. Health Manag. Pract., 5(5):86–97.
8
Carlson , C., Betteridge , J., Kisiel , B., , 2010. Towards an architecture for never-ending language learning. AAAI Conf. on Artificial Intelligence, p.1306–1313.
9
Chen , D.P., Weber , S.C., Constantinou , P.S., , 2007. Clinical arrays of laboratory measures, or “clinarrays”, built from an electronic health record enable disease subtyping by severity. AMIA Annual Symp. Proc., p.115–119.
10
Chen , X., Shrivastava , A., Gupta , A., 2013. NEIL: extracting visual knowledge from web data. IEEE Int. Conf. on Computer Vision, p.1409–1416.
11
Chen , Y., Carroll , R.J., Hinz , E.R.M., , 2013. Applying active learning to high-throughput phenotyping algo-rithms for electronic health records data. J. Am. Med. Inform. Assoc., 20(e2):253–259.
12
Cilibrasi , R.L., Vitanyi , P.M.B., 2007. The Google similarity distance.IEEE Trans. Knowl. Data Eng., 19(3):370–383.
13
Culotta , A., 2014. Estimating county health statistics with twitter. ACM Conf. on Human Factors in Computing Systems, p.1335–1344.
14
Daras , P., Manolopoulou , S., Axenopoulos , A., 2012. Search and retrieval of rich media objects supporting multiple multimodal queries.IEEE Trans. Multim., 14(3):734–746.
15
Davenport , T.H., Prusak , L., 1998. Working Knowledge: How Organizations Manage What They Know. Harvard Busi-ness School Press, Boston, p.5.
16
Deng , J., Dong , W., Socher , R., , 2009. ImageNet: a large- scale hierarchical image database. IEEE Conf. on Com-puter Vision and Pattern Recognition, p.248–255.
17
Dong , X., Gabrilovich , E., Heitz , G., , 2014. Knowledge vault: a Web-scale approach to probabilistic knowledge fusion. ACM SIGKDD Int. Conf. on Knowledge Dis-covery and Data Mining, p.601–610.
18
Fang , Q., Xu , C., Sang , J., , 2016. Folksonomy-based visual ontology construction and its applications.IEEE Trans. Multim., 18(4):702–713.
19
Fellbaum , C., Miller , G., 1998. WordNet: an Electronic Lexical Database. MIT Press, Cambridge, MA.
20
Feng , F., Wang , X., Li , R., 2014. Cross-modal retrieval with correspondence autoencoder. ACM Int. Conf. on Multi-media, p.7–16.
21
Ferrucci , D., Levas , A., Bagchi , S., , 2013. Watson: be-yond jeopardy!Artif. Intell., 199-200:93–105.
Garfield , E., 2004. Historiographic mapping of knowledge domains literature.J. Inform. Sci., 30(2):119–145.
24
Gibney , E., 2015. DeepMind algorithm beats people at classic video games.Nature, 518(7540):465–466.
25
Ginsberg , J., Mohebbi , M., Patel , R.S., , 2009. Detecting influenza epidemics using search engine query data.Na-ture, 457(7232):1012–1014.
26
Gong , Y., Ke , Q., Isard , M., , 2014. A multi-view em-bedding space for modeling internet images, tags, and their semantics.Int. J. Comput. Vis., 106(2):210–233.
Hodosh , M., Young , P., Hockenmaier , J., 2013. Framing image description as a ranking task: data, models and evaluation metrics.J. Artif. Intell. Res., 47(1):853–899.
29
Hotelling , H., 1936. Relations between two sets of variates.Biometrika, 28(3-4):321–377.
30
Hsu , F., 2002. Behind Deep Blue: Building the Computer that Defeated the World Chess Champion. Princeton Univer-sity Press, Princeton, USA.
31
Hua , Y., Wang , S., Liu , S., , 2014. TINA: cross-modal correlation learning by adaptive hierarchical semantic aggregation. IEEE Int. Conf. on Data Mining, p.190–199.
32
Jia , X., Gavves , E., Fernando , B., , 2015. Guiding long-short term memory for image caption generation. arXiv:1509.04942.
33
Johnson , J., Krishna , R., Stark , M., , 2015. Image retrieval using scene graphs. IEEE Conf. on Computer Vision and Pattern Recognition, p.3668–3678.
34
Karpathy , A., Li , F.F., 2015. Deep visual-semantic alignments for generating image descriptions. IEEE Conf. on Com-puter Vision and Pattern Recognition, p.3128-3137.
35
Krizhevsky , A., Sutskever , I., Hinton , G.E., 2012. ImageNet: classification with deep convolutional neural networks. Advances in Neural Information Processing Systems, p.1097–1105.
36
Kulkarni , G., Premraj , V., Dhar , S., , 2011. Baby talk: understanding and generating simple image descriptions. IEEE Conf. on Computer Vision and Pattern Recognition, p.1601–1608.
37
Kumar , S., Sanderford , M., Gray , V.E., , 2012. Evolu-tionary diagnosis method for variants in personal exomes.Nat. Meth., 9(9):855–856.
38
Kuznetsova , P., Ordonezz , V., Berg , T.L., , 2014. TREETALK: composition and compression of trees for image descriptions.Trans. Assoc. Comput. Ling., 2:351–362.
39
Lazaric , A., 2012. Transfer in reinforcement learning: a frame- work and a survey.In : Wiering, M., van Otterlo, M. (Eds.), Reinforcement Learning: State-of-the-Art. Springer Ber-lin Heidelberg, Berlin, p.143–173.
40
Lazer , D., Kennedy , R., King , G., , 2014. The parable of Google flu: traps in big data analysis. Science, 343(6176): 1203–1205.
41
Lew , M.S., Sebe , N., Djeraba , C., , 2006. Content-based multimedia information retrieval: state of the art and challenges.ACM Trans. Multim. Comput. Commun. Appl., 2(1):1–19.
42
Lin , T., Pantel , P., Gamon , M., , 2012. Active objects: actions for entity-centric search. ACM Int. Conf. on World Wide Web, p.589–598.
43
Luo , G., Tang , C., 2008. On iterative intelligent medical search. ACM SIGIR Conf. on Research and Development in In-formation Retrieval, p.3–10.
44
Mao , X., Lin , B., Cai , D., , 2013. Parallel field alignment for cross media retrieval. ACM Int. Conf. on Multimedia, p.897-906.
Prabhu , N., Babu , R.V., 2015. Attribute-Graph: a graph based approach to image ranking. IEEE Int. Conf. on Computer Vision, p.1071–1079.
55
Radinsky , K., Davidovich , S., Markovitch , S., 2012. Learning causality for news events prediction. Int. Conf. on World Wide Web, p.909–918.
56
Rasiwasia , N., Costa Pereira , J., Coviello , E., , 2010. A new approach to cross-modal multimedia retrieval. ACM Int. Conf. on Multimedia, p.251–260.
57
Rasiwasia , N., Mahajan , D., Mahadevan , V., , 2014. Cluster canonical correlation analysis. Int. Conf. on Arti-ficial Intelligence and Statistics, p.823–831.
58
Rautaray , S.S., Agrawal , A., 2015. Vision based hand gesture recognition for human computer interaction: a survey.Artif. Intell. Rev., 43(1):1–54.
59
Roller , S., Schulte im Walde , S., 2013. A multimodal LDA model integrating textual, cognitive and visual modalities. Conf. on Empirical Methods in Natural Language Pro-cessing, p.1146–1157.
60
Sadeghi , F., Divvala , S.K., Farhadi , A., 2015. VisKE: visual knowledge extraction and question answering by visual verification of relation phrases. IEEE Conf. on Computer Vision and Pattern Recognition, p.1456–1464.
61
Singhal , A., 2012. Introducing the knowledge graph: things, not strings. Official Blog of Google.
62
Socher , R., Lin , C., Ng , A.Y., , 2011. Parsing natural scenes and natural language with recursive neural net-works. Int. Conf. on Machine Learning, p.129–136.
63
Socher , R., Karpathy , A., Le , Q., , 2014. Grounded compositional semantics for finding and describing im-ages with sentences.Trans. Assoc. Comput. Ling., 2:207–218.
64
Srivastava , N., Salakhutdinov , R., 2012. Multimodal learning with deep Boltzmann machines. Advances in Neural In-formation Processing Systems, p.2222–2230.
65
Suchanek , F., Weikum , G., 2014. Knowledge bases in the age of big data analytics.Proc. VLDB Endow., 7(13):1713–1714.
66
Uyar , A., Aliyu , F.M., 2015. Evaluating search features of Google Knowledge Graph and Bing Satori: entity types, list searches and query interfaces. Onl. Inform. Rev., 39(2):197–213.
67
Vinyals , O., Toshev , A., Bengio , S., , 2015. Show and tell: a neural image caption generator. IEEE Conf. on Com-puter Vision and Pattern Recognition, p.3156–3164.
68
Wang , D., Cui , P., Ou , M., , 2015. Learning compact hash codes for multimodal representations using orthogonal deep structure.IEEE Trans. Multim., 17(9): 1404–1416.
69
Wang , W., Ooi , B.C., Yang , X., , 2014. Effective multi- modal retrieval based on stacked auto-encoders.Proc. VLDB Endow., 7(8):649–660.
70
Wang , Y., Wu , F., Song , J., , 2014. Multi-modal mutual topic reinforce modeling for cross-media retrieval. ACM Int. Conf. on Multimedia, p.307–316.
71
Wei , Y., Zhao , Y., Lu , C., , 2017. Cross-modal retrieval with CNN visual features: a new baseline. IEEE Trans. Cybern., 47(2):449–460.
72
Wu , W., Xu , J., Li , H., 2010. Learning similarity function between objects in heterogeneous spaces. Technique Report MSR-TR-2010-86, Microsoft.
73
Xu , K., Ba , J., Kiros , R., , 2015. Show, attend and tell: neural image caption generation with visual attention. Int. Conf. on Machine Learning, p.2048–2057.
74
Yang , Y., Zhuang , Y., Wu , F., , 2008. Harmonizing hier-archical manifolds for multimedia document semantics understanding and cross-media retrieval.IEEE Trans. Multim., 10(3):437–446.
75
Yang , Y., Teo , C.L., Daume , H., , 2011. Corpus-guided sentence generation of natural images. Conf. on Empiri-cal Methods in Natural Language Processing, p.444–454.
76
Yang , Y., Nie , F., Xu , D., , 2012. A multimedia retrieval framework based on semi-supervised ranking and rele-vance feedback.IEEE Trans. Patt. Anal. Mach. Intell., 34(4):723–742.
77
Yuan , L., Pan , C., Ji , S., , 2014. Automated annotation of developmental stages of Drosophila embryos in images containing spatial patterns of expression.Bioinformatics, 30(2):266–273.
78
Zhai , X., Peng , Y., Xiao , J., 2014. Learning cross-media joint representation with sparse and semi-supervised regulari-zation.IEEE Trans. Circ. Syst. Video Technol., 24(6):965–978.
79
Zhang , H., Yang , Y., Luan , H., , 2014a. Start from scratch: towards automatically identifying, modeling, and naming visual attributes. ACM Int. Conf. on Multimedia, p.187–196.
80
Zhang , H., Yuan , J., Gao , X., , 2014b. Boosting cross- media retrieval via visual-auditory feature analysis and relevance feedback. ACM Int. Conf. on Multimedia, p.953–956.
81
Zhang , H., Shang , X., Luan , H., , 2016. Learning from collective intelligence: feature learning using social im-ages and tags. ACM Trans. Multim. Comput. Commun. Appl., 13(1):1.
82
Zhang , J., Wang , S., Huang , Q., 2015. Location-based parallel tag completion for geo-tagged social image retrieval. ACM Int. Conf. on Multimedia Retrieval, p.355–362.
83
Zhu , Y., Zhang , C., Ré , C., , 2015. Building a large-scale multimodal knowledge base system for answering visual queries. arXiv:1507.05670.