Information geometry in optimization, machine
learning and statistical inference

doi:10.1007/s11460-010-0101-3

Front. Electr. Electron. Eng.

2010, Vol. 5

Issue (3) : 241-260 https://doi.org/10.1007/s11460-010-0101-3

Research articles

Information geometry in optimization, machine learning and statistical inference

Shun-ichi AMARI,

RIKENBrainScienceInstitute,Saitama351-0198,Japan;

Download: PDF(374 KB)
Export: BibTeX | EndNote | Reference Manager | ProCite | RefWorks

Abstract The present article gives an introduction to information geometry and surveys its applications in the area of machine learning, optimization and statistical inference. Information geometry is explained intuitively by using divergence functions introduced in a manifold of probability distributions and other general manifolds. They give a Riemannian structure together with a pair of dual flatness criteria. Many manifolds are dually flat. When a manifold is dually flat, a generalized Pythagorean theorem and related projection theorem are introduced. They provide useful means for various approximation and optimization problems. We apply them to alternative minimization problems, YingYang machines and belief propagation algorithm in machine learning.

Keywords information geometry machine learning optimization statistical inference divergence graphical model Ying-Yang machine

Issue Date: 05 September 2010

Cite this article:

Shun-ichi AMARI. Information geometry in optimization, machine learning and statistical inference[J]. Front. Electr. Electron. Eng., 2010, 5(3): 241-260.

URL:

https://academic.hep.com.cn/fee/EN/10.1007/s11460-010-0101-3
https://academic.hep.com.cn/fee/EN/Y2010/V5/I3/241

	Amari S, Nagaoka H. Methods of Information Geometry. New York: Oxford University Press, 2000
	Csiszár I. Information-typemeasures of difference of probability distributions and indirect observations. Studia Scientiarum Mathematicarum Hungarica, 1967, 2: 299―318
	Bregman L. Therelaxation method of finding the common point of convex sets and itsapplication to the solution of problems in convex programming. USSR Computational Mathematics and MathematicalPhysics, 1967, 7(3): 200―217 doi: 10.1016/0041-5553(67)90040-7
	Eguchi S. Secondorder efficiency of minimum contrast estimators in a curved exponentialfamily. The Annals of Statistics, 1983, 11(3): 793―803 doi: 10.1214/aos/1176346246
	Chentsov N N. Statistical Decision Rules and Optimal Inference. Rhode Island, USA: American Mathematical Society, 1982 (originally published in Russian, Moscow: Nauka, 1972)
	Dempster A P, Laird N M, Rubin D B. Maximum likelihood from incomplete data via the EM algorithm. Journal of the Royal Statistical Society. SeriesB, 1977, 39(1): 1―38
	Csiszár I, Tusnády G. Information geometry andalternating minimization procedures. Statisticsand Decisions, 1984, Supplement Issue 1: 205―237
	Amari S. Informationgeometry of the EM and em algorithms for neural networks. Neural Networks, 1995, 8(9): 1379―1408 doi: 10.1016/0893-6080(95)00003-8
	Xu L. BayesianYing-Yang machine, clustering and number of clusters. Pattern Recognition Letters, 1997, 18(11―13): 1167―1178 doi: 10.1016/S0167-8655(97)00121-9
	Xu L. RBFnets, mixture experts, and Bayesian Ying-Yang learning. Neurocomputing, 1998, 19(1―3): 223―257 doi: 10.1016/S0925-2312(97)00091-X
	Xu L. BayesianKullback Ying-Yang dependence reduction theory. Neurocomputing, 1998, 22(1―3): 81―111 doi: 10.1016/S0925-2312(98)00051-4
	Xu L. BYYharmony learning, independent state space, and generalized APT financialanalyses. IEEE Transactions on Neural Networks, 2001, 12(4): 822―849 doi: 10.1109/72.935094
	Xu L. Bestharmony, unified RPCL and automated model selection for unsupervisedand supervised learning on Gaussian mixtures, three-layer nets andME-RBF-SVM models. International Journalof Neural Systems, 2001, 11(1): 43―69 doi: 10.1016/S0129-0657(01)00049-7
	Xu L. BYYharmony learning, structural RPCL, and topological self-organizingon mixture models. Neural Networks, 2002, 15(8―9): 1125―1151 doi: 10.1016/S0893-6080(02)00084-9
	Pearl J. ProbabilisticReasoning in Intelligent Systems: Networks of Plausible Inference. San Mateo, CA: Morgan Kaufmann, 1988
	Ikeda S, Tanaka T, Amari S. Information geometry of turbo and low-density parity-checkcodes. IEEE Transactions on InformationTheory, 2004, 50(6): 1097―1114 doi: 10.1109/TIT.2004.828072
	Ikeda S, Tanaka T, Amari S. Stochastic reasoning, free energy, and information geometry. Neural Computation, 2004, 16(9): 1779―1810 doi: 10.1162/0899766041336477
	Csiszár I. Informationmeasures: A critical survey. In: Transactionsof the 7th Prague Conference. 1974, 83―86
	Csiszár I. Axiomaticcharacterizations of information measures. Entropy, 2008, 10(3): 261―273 doi: 10.3390/e10030261
	Ali M S, Silvey S D. A general class of coefficientsof divergence of one distribution from another. Journal of the Royal Statistical Society. Series B, 1966, 28(1): 131―142
	Amari S. α-divergence is unique, belonging toboth f-divergence and Bregman divergenceclasses. IEEE Transactions on InformationTheory, 2009, 55(11): 4925―4931 doi: 10.1109/TIT.2009.2030485
	Cichocki A, Adunek R, Phan A H, Amari S. NonnegativeMatrix and Tensor Factorizations. JohnWiley, 2009 doi: 10.1002/9780470747278
	Havrda J, Charvát F. Quantification method ofclassification process: Concept of structural α-entropy. Kybernetika, 1967, 3: 30―35
	Chernoff H. Ameasure of asymptotic efficiency for tests of a hypothesis based onthe sum of observations. Annals of MathematicalStatistics, 1952, 23(4): 493―507 doi: 10.1214/aoms/1177729330
	Matsuyama Y. The α-EM algorithm: Surrogate likelihoodmaximization using α-logarithmicinformation measures. IEEE Transactionson Information Theory, 2002, 49(3): 672―706
	Amari S. Integrationof stochastic models by minimizing α-divergence. Neural Computation, 2007, 19(10): 2780―2796 doi: 10.1162/neco.2007.19.10.2780
	Amari S. Informationgeometry and its applications: Convex function and dually flat manifold. In: Nielsen F ed. Emerging Trends in Visual Computing. LectureNotes in Computer Science, Vol 5416. Berlin: Springer-Verlag, 2009, 75―102 doi: 10.1007/978-3-642-00826-9_4
	Eguchi S, Copas J. A class of logistic-typediscriminant functions. Biometrika, 2002, 89(1): 1―22 doi: 10.1093/biomet/89.1.1
	Murata N, Takenouchi T, Kanamori T, Eguchi S. Informationgeometry of U-boost and Bregmandivergence. Neural Computation, 2004, 16(7): 1437―1481 doi: 10.1162/089976604323057452
	Minami M, Eguchi S. Robust blind source separationby beta-divergence. Neural Computation, 2002, 14(8): 1859―1886 doi: 10.1162/089976602760128045
	Byrne W. Alternatingminimization and Boltzmann machine learning. IEEE Transactions on Neural Networks, 1992, 3(4): 612―620 doi: 10.1109/72.143375
	Amari S, Kurata K, Nagaoka H. Information geometry of Boltzmann machines. IEEE Transactions on Neural Networks, 1992, 3(2): 260―271 doi: 10.1109/72.125867
	Amari S. Naturalgradient works efficiently in learning. Neural Computation, 1998, 10(2): 251―276 doi: 10.1162/089976698300017746
	Amari S, Takeuchi A. Mathematical theory on formationof category detecting nerve cells. BiologicalCybernetics, 1978, 29(3): 127―136 doi: 10.1007/BF00337348
	Jordan M I. Learning in Graphical Models. Cambridge,MA: MIT Press, 1999
	Yuille A L. CCCP algorithms to minimize the Bethe and Kikuchi free energies:Convergent alternatives to belief propagation. Neural Computation, 2002, 14(7): 1691―1722 doi: 10.1162/08997660260028674
	Yuille A L, Rangarajan A. The concave-convex procedure. Neural Computation, 2003, 15(4): 915―936 doi: 10.1162/08997660360581958
	Opper M, Saad D. Advanced Mean Field Methods―Theoryand Practice. Cambridge, MA: MIT Press, 2001
	Tanaka T. Informationgeometry of mean-field approximation. NeuralComputation, 2000, 12(8): 1951―1968 doi: 10.1162/089976600300015213
	Amari S, Ikeda S, Shimokawa H. Information geometry and mean field approximation: The α-projection approach. In: Opper M, Saad D, eds. Advanced Mean Field Methods―Theory and Practice. Cambridge, MA: MIT Press, 2001, 241―257

[1]	Lei CHEN, Hongjun WANG, Guangguo BI, Min ZHANG. An improved cooperative spectrum detection algorithm for cognitive radio[J]. Front Elect Electr Eng, 2012, 7(4): 367-373.
[2]	Khalil Gorgani FIROUZJAH, Abdolreza SHEIKHOLESLAMI, Taghi BARFOROUSHI. Multi-objective allocation of measuring system based on binary particle swarm optimization[J]. Front Elect Electr Eng, 2012, 7(4): 399-415.
[3]	Xinbo GAO, Xiumei WANG. Dimensionality reduction with latent variable model[J]. Front Elect Electr Eng, 2012, 7(1): 116-126.
[4]	Chun-Hung CHEN, Leyuan SHI, Loo Hay LEE. Stochastic systems simulation optimization[J]. Front Elect Electr Eng Chin, 2011, 6(3): 468-480.
[5]	Chen SONG, Xiaohong GUAN, Qianchuan ZHAO, Qing-Shan JIA. Remanufacturing planning based on constrained ordinal optimization[J]. Front Elect Electr Eng Chin, 2011, 6(3): 443-452.
[6]	Xi-Ren CAO. Stochastic learning and optimization — Ideas vs mathematics?[J]. Front Elect Electr Eng Chin, 2011, 6(3): 398-411.
[7]	Lijun ZHANG, Zhengguang CHEN, Miao ZHENG, Xiaofei HE. Robust non-negative matrix factorization[J]. Front Elect Electr Eng Chin, 2011, 6(2): 192-200.
[8]	Zhi-Hua Zhou. When semi-supervised learning meets ensemble learning[J]. Front Elect Electr Eng Chin, 2011, 6(1): 6-16.
[9]	Xiaofei HE, Binbin LIN. Tangent space learning and generalization[J]. Front Elect Electr Eng Chin, 2011, 6(1): 27-42.
[10]	Kazushi IKEDA, Daisuke KOMAZAWA, Hiroyuki FUNAYA. Information geometry in neural spike sequences[J]. Front Elect Electr Eng Chin, 2011, 6(1): 146-150.
[11]	Zhiyao MA, Wei CHEN, Zhigang CAO, Khaled Ben LETAIEF. Location-aware downlink and uplink power control and throughput optimization in cellular cognitive radio networks[J]. Front Elect Electr Eng Chin, 2010, 5(4): 441-448.
[12]	Jeffrey Zhi J. ZHENG, Christian H. ZHENG, . A framework to express variant and invariant functional spaces for binary logic[J]. Front. Electr. Electron. Eng., 2010, 5(2): 163-172.
[13]	Siyang SUN, Yinghua LU, Jinling ZHANG, Fangming RUAN, . Genetic algorithm optimization of broadband microstrip antenna[J]. Front. Electr. Electron. Eng., 2010, 5(2): 185-187.
[14]	Tong ZHANG, Yahui WANG, Anli YE, Jian WANG, Jianchao ZENG. A new research of identification strategy based on particle swarm optimization and least square[J]. Front Elect Electr Eng Chin, 2009, 4(3): 313-317.
[15]	Fei YU, Luxi YANG. Joint optimization in MIMO multiple relay channels[J]. Front Elect Electr Eng Chin, 2009, 4(2): 155-160.

Viewed

Full text

Abstract

Cited

Shared

Discussed