跨媒体分析与推理：研究进展与发展方向

doi:10.1631/FITEE.1601787

Frontiers of Information Technology & Electronic Engineering

2017, Vol. 18

Issue (1): 44-57 https://doi.org/10.1631/FITEE.1601787

本期目录

跨媒体分析与推理：研究进展与发展方向

彭宇新¹(

),朱文武²(

),赵耀³,徐常胜⁴,黄庆明⁵,卢汉清⁴,郑庆华⁶,黄铁军⁷,高文⁷

¹. 北京大学计算机科学技术研究所
². 清华大学计算机科学与技术系
³. 北京交通大学信息科学研究所
⁴. 中国科学院自动化研究所、模式识别国家重点实验室
⁵. 中国科学院计算技术研究所智能信息处理重点实验室
⁶. 西安交通大学计算机科学与技术系
⁷. 北京大学信息科学技术学院

Cross-media analysis and reasoning: advances and directions

Yu-xin PENG¹(

),Wen-wu ZHU²(

),Yao ZHAO³,Chang-sheng XU⁴,Qing-ming HUANG⁵,Han-qing LU⁴,Qing-hua ZHENG⁶,Tie-jun HUANG⁷,Wen GAO⁷

¹. Institute of Computer Science and Technology, Peking University, Beijing 100871, China
². Department of Computer Science and Technology, Tsinghua University, Beijing 100084, China
³. Institute of Information Science, Beijing Jiaotong University, Beijing 100044, China
⁴. National Laboratory of Pattern Recognition, Institute of Automation, Chinese Academy of Sciences, Beijing 100190, China
⁵. Key Laboratory of Intelligent Information Processing, Institute of Computing Technology, Chinese Academy of Sciences, Beijing 100190, China
⁶. Department of Computer Science and Technology, Xi’an Jiaotong University, Xi’an 710049, China
⁷. School of Electronics Engineering and Computer Science, Peking University, Beijing 100871, China

全文: PDF(996 KB)

摘要:

跨媒体分析与推理是计算机科学的热点问题，也是人工智能中一个具有广阔前景的研究方向。目前，尚未有文献对跨媒体分析与推理的现有方法进行归纳总结并给出它的研究进展、挑战及发展方向。为解决这些问题，本文从七个方面进行综述：（1）跨媒体统一表征理论与模型；（2）跨媒体关联理解与深度挖掘；（3）跨媒体知识图谱构建与学习方法；（4）跨媒体知识演化与推理；（5）跨媒体描述与生成；（6）跨媒体智能引擎；（7）跨媒体智能应用。本文的目标是给出跨媒体分析与推理的方法、进展以及发展方向，吸引更多人关注该领域的最新进展，通过探讨面临的挑战和研究方向，为研究者提供重要参考。

Abstract：

Cross-media analysis and reasoning is an active research area in computer science, and a promising direction for artificial intelligence. However, to the best of our knowledge, no existing work has summarized the state-of-the-art methods for cross-media analysis and reasoning or presented advances, challenges, and future directions for the field. To address these issues, we provide an overview as follows: (1) theory and model for cross-media uniform representation; (2) cross-media correlation understanding and deep mining; (3) cross-media knowledge graph construction and learning methodologies; (4) cross-media knowledge evolution and reasoning; (5) cross-media description and generation; (6) cross-media intelligent engines; and (7) cross-media intelligent applications. By presenting approaches, advances, and future directions in cross-media analysis and rea-soning, our goal is not only to draw more attention to the state-of-the-art advances in the field, but also to provide technical insights by discussing the challenges and research directions in these areas.

Key words： Cross-media analysis Cross-media reasoning Cross-media applications

收稿日期: 2016-12-07 出版日期: 2017-02-27

通讯作者: 彭宇新,朱文武 E-mail: pengyuxin@pku.edu.cn;wwzhu@tsinghua.edu.cn

Corresponding Author(s): Yu-xin PENG,Wen-wu ZHU

引用本文:

彭宇新,朱文武,赵耀,徐常胜,黄庆明,卢汉清,郑庆华,黄铁军,高文. 跨媒体分析与推理：研究进展与发展方向[J]. Frontiers of Information Technology & Electronic Engineering, 2017, 18(1): 44-57.
Yu-xin PENG,Wen-wu ZHU,Yao ZHAO,Chang-sheng XU,Qing-ming HUANG,Han-qing LU,Qing-hua ZHENG,Tie-jun HUANG,Wen GAO. Cross-media analysis and reasoning: advances and directions. Front. Inform. Technol. Electron. Eng, 2017, 18(1): 44-57.

链接本文:

https://academic.hep.com.cn/fitee/CN/10.1631/FITEE.1601787
https://academic.hep.com.cn/fitee/CN/Y2017/V18/I1/44

1	Aamodt , A., Plaza , E., 1994. Case-based reasoning: founda-tional issues, methodological variations, and system ap-proaches.AI Commun., 7(1):39–59.
2	Adib , F., Hsu , C.Y., Mao , H., , 2015. Capturing the human figure through a wall.ACM Trans. Graph., 34(6):219.
3	Andrew , G., Arora , R., Bilmes , J., , 2013. Deep canonical correlation analysis. Int. Conf. on Machine Learning, p.1247–1255.
4	Antenucci , D., Li , E., Liu , S., , 2013. Ringtail: a gener-alized nowcasting system. Proc. VLDB Endow., 6(12): 1358–1361.
5	Antol , S., Agrawal , A., Lu , J., , 2015. VQA: visual ques-tion answering. IEEE Int. Conf. on Computer Vision, p.2425–2433.
6	Babenko , A., Slesarev , A., Chigorin , A., , 2014. Neural codes for image retrieval. European Conf. on Computer Vision, p.584–599.
7	Brownson , R.C., Gurney , J.G., Land , G.H., 1999. Evidence- based decision making in public health.J. Publ. Health Manag. Pract., 5(5):86–97.
8	Carlson , C., Betteridge , J., Kisiel , B., , 2010. Towards an architecture for never-ending language learning. AAAI Conf. on Artificial Intelligence, p.1306–1313.
9	Chen , D.P., Weber , S.C., Constantinou , P.S., , 2007. Clinical arrays of laboratory measures, or “clinarrays”, built from an electronic health record enable disease subtyping by severity. AMIA Annual Symp. Proc., p.115–119.
10	Chen , X., Shrivastava , A., Gupta , A., 2013. NEIL: extracting visual knowledge from web data. IEEE Int. Conf. on Computer Vision, p.1409–1416.
11	Chen , Y., Carroll , R.J., Hinz , E.R.M., , 2013. Applying active learning to high-throughput phenotyping algo-rithms for electronic health records data. J. Am. Med. Inform. Assoc., 20(e2):253–259.
12	Cilibrasi , R.L., Vitanyi , P.M.B., 2007. The Google similarity distance.IEEE Trans. Knowl. Data Eng., 19(3):370–383.
13	Culotta , A., 2014. Estimating county health statistics with twitter. ACM Conf. on Human Factors in Computing Systems, p.1335–1344.
14	Daras , P., Manolopoulou , S., Axenopoulos , A., 2012. Search and retrieval of rich media objects supporting multiple multimodal queries.IEEE Trans. Multim., 14(3):734–746.
15	Davenport , T.H., Prusak , L., 1998. Working Knowledge: How Organizations Manage What They Know. Harvard Busi-ness School Press, Boston, p.5.
16	Deng , J., Dong , W., Socher , R., , 2009. ImageNet: a large- scale hierarchical image database. IEEE Conf. on Com-puter Vision and Pattern Recognition, p.248–255.
17	Dong , X., Gabrilovich , E., Heitz , G., , 2014. Knowledge vault: a Web-scale approach to probabilistic knowledge fusion. ACM SIGKDD Int. Conf. on Knowledge Dis-covery and Data Mining, p.601–610.
18	Fang , Q., Xu , C., Sang , J., , 2016. Folksonomy-based visual ontology construction and its applications.IEEE Trans. Multim., 18(4):702–713.
19	Fellbaum , C., Miller , G., 1998. WordNet: an Electronic Lexical Database. MIT Press, Cambridge, MA.
20	Feng , F., Wang , X., Li , R., 2014. Cross-modal retrieval with correspondence autoencoder. ACM Int. Conf. on Multi-media, p.7–16.
21	Ferrucci , D., Levas , A., Bagchi , S., , 2013. Watson: be-yond jeopardy!Artif. Intell., 199-200:93–105.
22	Fuentes-Pacheco , J., Ruiz-Ascencio , J., Rendón-Mancha , J.M., 2015. Visual simultaneous localization and mapping: a survey. Artif. Intell. Rev., 43(1):55–81.
23	Garfield , E., 2004. Historiographic mapping of knowledge domains literature.J. Inform. Sci., 30(2):119–145.
24	Gibney , E., 2015. DeepMind algorithm beats people at classic video games.Nature, 518(7540):465–466.
25	Ginsberg , J., Mohebbi , M., Patel , R.S., , 2009. Detecting influenza epidemics using search engine query data.Na-ture, 457(7232):1012–1014.
26	Gong , Y., Ke , Q., Isard , M., , 2014. A multi-view em-bedding space for modeling internet images, tags, and their semantics.Int. J. Comput. Vis., 106(2):210–233.
27	Hochreiter , S., Schmidhuber , J., 1997. Long short-term memory. Neur. Comput., 9(8):1735–1780.
28	Hodosh , M., Young , P., Hockenmaier , J., 2013. Framing image description as a ranking task: data, models and evaluation metrics.J. Artif. Intell. Res., 47(1):853–899.
29	Hotelling , H., 1936. Relations between two sets of variates.Biometrika, 28(3-4):321–377.
30	Hsu , F., 2002. Behind Deep Blue: Building the Computer that Defeated the World Chess Champion. Princeton Univer-sity Press, Princeton, USA.
31	Hua , Y., Wang , S., Liu , S., , 2014. TINA: cross-modal correlation learning by adaptive hierarchical semantic aggregation. IEEE Int. Conf. on Data Mining, p.190–199.
32	Jia , X., Gavves , E., Fernando , B., , 2015. Guiding long-short term memory for image caption generation. arXiv:1509.04942.
33	Johnson , J., Krishna , R., Stark , M., , 2015. Image retrieval using scene graphs. IEEE Conf. on Computer Vision and Pattern Recognition, p.3668–3678.
34	Karpathy , A., Li , F.F., 2015. Deep visual-semantic alignments for generating image descriptions. IEEE Conf. on Com-puter Vision and Pattern Recognition, p.3128-3137.
35	Krizhevsky , A., Sutskever , I., Hinton , G.E., 2012. ImageNet: classification with deep convolutional neural networks. Advances in Neural Information Processing Systems, p.1097–1105.
36	Kulkarni , G., Premraj , V., Dhar , S., , 2011. Baby talk: understanding and generating simple image descriptions. IEEE Conf. on Computer Vision and Pattern Recognition, p.1601–1608.
37	Kumar , S., Sanderford , M., Gray , V.E., , 2012. Evolu-tionary diagnosis method for variants in personal exomes.Nat. Meth., 9(9):855–856.
38	Kuznetsova , P., Ordonezz , V., Berg , T.L., , 2014. TREETALK: composition and compression of trees for image descriptions.Trans. Assoc. Comput. Ling., 2:351–362.
39	Lazaric , A., 2012. Transfer in reinforcement learning: a frame- work and a survey.In : Wiering, M., van Otterlo, M. (Eds.), Reinforcement Learning: State-of-the-Art. Springer Ber-lin Heidelberg, Berlin, p.143–173.
40	Lazer , D., Kennedy , R., King , G., , 2014. The parable of Google flu: traps in big data analysis. Science, 343(6176): 1203–1205.
41	Lew , M.S., Sebe , N., Djeraba , C., , 2006. Content-based multimedia information retrieval: state of the art and challenges.ACM Trans. Multim. Comput. Commun. Appl., 2(1):1–19.
42	Lin , T., Pantel , P., Gamon , M., , 2012. Active objects: actions for entity-centric search. ACM Int. Conf. on World Wide Web, p.589–598.
43	Luo , G., Tang , C., 2008. On iterative intelligent medical search. ACM SIGIR Conf. on Research and Development in In-formation Retrieval, p.3–10.
44	Mao , X., Lin , B., Cai , D., , 2013. Parallel field alignment for cross media retrieval. ACM Int. Conf. on Multimedia, p.897-906.
45	McGurk , H., MacDonald , J., 1976. Hearing lips and seeing voices.Nature, 264(5588):746–748.
46	MIT Technology Review, 2014. Data driven healthcare.
47	Mnih , V., Kavukcuoglu , K., Silver , D., 2015. Human-level control through deep reinforcement learning. Nature, 518(7540):529–333.
48	Ngiam , J., Khosla , A., Kim , M., , 2011. Multimodal deep learning. Int. Conf. on Machine Learning, p.689–696.
49	Ordonez , V., Kulkarni , G., Berg , T.L., 2011. Im2text: describ-ing images using 1 million captioned photographs. Ad-vances in Neural Information Processing Systems, p.1143–1151.
50	Pan , Y.H., 2016. Heading toward artificial intelligence 2.0.Engineering, 2(4):409–413.
51	Pearl , J., 2000. Causality: Models, Reasoning and Inference. Cambridge University Press, Cambridge, UK.
52	Peng , Y., Huang , X., Qi , J., 2016a. Cross-media shared repre-sentation by hierarchical learning with multiple deep networks. Int. Joint Conf. on Artificial Intelligence, p.3846–3853.
53	Peng , Y., Zhai , X., Zhao , Y., , 2016b. Semi-supervised cross-media feature learning with unified patch graph regularization.IEEE Trans. Circ. Syst. Video Technol., 26(3):583–596.
54	Prabhu , N., Babu , R.V., 2015. Attribute-Graph: a graph based approach to image ranking. IEEE Int. Conf. on Computer Vision, p.1071–1079.
55	Radinsky , K., Davidovich , S., Markovitch , S., 2012. Learning causality for news events prediction. Int. Conf. on World Wide Web, p.909–918.
56	Rasiwasia , N., Costa Pereira , J., Coviello , E., , 2010. A new approach to cross-modal multimedia retrieval. ACM Int. Conf. on Multimedia, p.251–260.
57	Rasiwasia , N., Mahajan , D., Mahadevan , V., , 2014. Cluster canonical correlation analysis. Int. Conf. on Arti-ficial Intelligence and Statistics, p.823–831.
58	Rautaray , S.S., Agrawal , A., 2015. Vision based hand gesture recognition for human computer interaction: a survey.Artif. Intell. Rev., 43(1):1–54.
59	Roller , S., Schulte im Walde , S., 2013. A multimodal LDA model integrating textual, cognitive and visual modalities. Conf. on Empirical Methods in Natural Language Pro-cessing, p.1146–1157.
60	Sadeghi , F., Divvala , S.K., Farhadi , A., 2015. VisKE: visual knowledge extraction and question answering by visual verification of relation phrases. IEEE Conf. on Computer Vision and Pattern Recognition, p.1456–1464.
61	Singhal , A., 2012. Introducing the knowledge graph: things, not strings. Official Blog of Google.
62	Socher , R., Lin , C., Ng , A.Y., , 2011. Parsing natural scenes and natural language with recursive neural net-works. Int. Conf. on Machine Learning, p.129–136.
63	Socher , R., Karpathy , A., Le , Q., , 2014. Grounded compositional semantics for finding and describing im-ages with sentences.Trans. Assoc. Comput. Ling., 2:207–218.
64	Srivastava , N., Salakhutdinov , R., 2012. Multimodal learning with deep Boltzmann machines. Advances in Neural In-formation Processing Systems, p.2222–2230.
65	Suchanek , F., Weikum , G., 2014. Knowledge bases in the age of big data analytics.Proc. VLDB Endow., 7(13):1713–1714.
66	Uyar , A., Aliyu , F.M., 2015. Evaluating search features of Google Knowledge Graph and Bing Satori: entity types, list searches and query interfaces. Onl. Inform. Rev., 39(2):197–213.
67	Vinyals , O., Toshev , A., Bengio , S., , 2015. Show and tell: a neural image caption generator. IEEE Conf. on Com-puter Vision and Pattern Recognition, p.3156–3164.
68	Wang , D., Cui , P., Ou , M., , 2015. Learning compact hash codes for multimodal representations using orthogonal deep structure.IEEE Trans. Multim., 17(9): 1404–1416.
69	Wang , W., Ooi , B.C., Yang , X., , 2014. Effective multi- modal retrieval based on stacked auto-encoders.Proc. VLDB Endow., 7(8):649–660.
70	Wang , Y., Wu , F., Song , J., , 2014. Multi-modal mutual topic reinforce modeling for cross-media retrieval. ACM Int. Conf. on Multimedia, p.307–316.
71	Wei , Y., Zhao , Y., Lu , C., , 2017. Cross-modal retrieval with CNN visual features: a new baseline. IEEE Trans. Cybern., 47(2):449–460.
72	Wu , W., Xu , J., Li , H., 2010. Learning similarity function between objects in heterogeneous spaces. Technique Report MSR-TR-2010-86, Microsoft.
73	Xu , K., Ba , J., Kiros , R., , 2015. Show, attend and tell: neural image caption generation with visual attention. Int. Conf. on Machine Learning, p.2048–2057.
74	Yang , Y., Zhuang , Y., Wu , F., , 2008. Harmonizing hier-archical manifolds for multimedia document semantics understanding and cross-media retrieval.IEEE Trans. Multim., 10(3):437–446.
75	Yang , Y., Teo , C.L., Daume , H., , 2011. Corpus-guided sentence generation of natural images. Conf. on Empiri-cal Methods in Natural Language Processing, p.444–454.
76	Yang , Y., Nie , F., Xu , D., , 2012. A multimedia retrieval framework based on semi-supervised ranking and rele-vance feedback.IEEE Trans. Patt. Anal. Mach. Intell., 34(4):723–742.
77	Yuan , L., Pan , C., Ji , S., , 2014. Automated annotation of developmental stages of Drosophila embryos in images containing spatial patterns of expression.Bioinformatics, 30(2):266–273.
78	Zhai , X., Peng , Y., Xiao , J., 2014. Learning cross-media joint representation with sparse and semi-supervised regulari-zation.IEEE Trans. Circ. Syst. Video Technol., 24(6):965–978.
79	Zhang , H., Yang , Y., Luan , H., , 2014a. Start from scratch: towards automatically identifying, modeling, and naming visual attributes. ACM Int. Conf. on Multimedia, p.187–196.
80	Zhang , H., Yuan , J., Gao , X., , 2014b. Boosting cross- media retrieval via visual-auditory feature analysis and relevance feedback. ACM Int. Conf. on Multimedia, p.953–956.
81	Zhang , H., Shang , X., Luan , H., , 2016. Learning from collective intelligence: feature learning using social im-ages and tags. ACM Trans. Multim. Comput. Commun. Appl., 13(1):1.
82	Zhang , J., Wang , S., Huang , Q., 2015. Location-based parallel tag completion for geo-tagged social image retrieval. ACM Int. Conf. on Multimedia Retrieval, p.355–362.
83	Zhu , Y., Zhang , C., Ré , C., , 2015. Building a large-scale multimodal knowledge base system for answering visual queries. arXiv:1507.05670.

[1]		Download
[2]		Download

Viewed

Full text

Abstract

Cited

Shared

Discussed