|
|
Information geometry in optimization, machine
learning and statistical inference |
Shun-ichi AMARI, |
RIKENBrainScienceInstitute,Saitama351-0198,Japan; |
|
|
Abstract The present article gives an introduction to information geometry and surveys its applications in the area of machine learning, optimization and statistical inference. Information geometry is explained intuitively by using divergence functions introduced in a manifold of probability distributions and other general manifolds. They give a Riemannian structure together with a pair of dual flatness criteria. Many manifolds are dually flat. When a manifold is dually flat, a generalized Pythagorean theorem and related projection theorem are introduced. They provide useful means for various approximation and optimization problems. We apply them to alternative minimization problems, YingYang machines and belief propagation algorithm in machine learning.
|
Keywords
information geometry
machine learning
optimization
statistical inference
divergence
graphical model
Ying-Yang machine
|
Issue Date: 05 September 2010
|
|
|
Amari S, Nagaoka H. Methods of Information Geometry. New York: Oxford University Press, 2000
|
|
Csiszár I. Information-typemeasures of difference of probability distributions and indirect observations. Studia Scientiarum Mathematicarum Hungarica, 1967, 2: 299―318
|
|
Bregman L. Therelaxation method of finding the common point of convex sets and itsapplication to the solution of problems in convex programming. USSR Computational Mathematics and MathematicalPhysics, 1967, 7(3): 200―217
doi: 10.1016/0041-5553(67)90040-7
|
|
Eguchi S. Secondorder efficiency of minimum contrast estimators in a curved exponentialfamily. The Annals of Statistics, 1983, 11(3): 793―803
doi: 10.1214/aos/1176346246
|
|
Chentsov N N. Statistical Decision Rules and Optimal Inference. Rhode Island, USA: American Mathematical Society, 1982 (originally published in Russian, Moscow: Nauka, 1972)
|
|
Dempster A P, Laird N M, Rubin D B. Maximum likelihood from incomplete data via the EM algorithm. Journal of the Royal Statistical Society. SeriesB, 1977, 39(1): 1―38
|
|
Csiszár I, Tusnády G. Information geometry andalternating minimization procedures. Statisticsand Decisions, 1984, Supplement Issue 1: 205―237
|
|
Amari S. Informationgeometry of the EM and em algorithms for neural networks. Neural Networks, 1995, 8(9): 1379―1408
doi: 10.1016/0893-6080(95)00003-8
|
|
Xu L. BayesianYing-Yang machine, clustering and number of clusters. Pattern Recognition Letters, 1997, 18(11―13): 1167―1178
doi: 10.1016/S0167-8655(97)00121-9
|
|
Xu L. RBFnets, mixture experts, and Bayesian Ying-Yang learning. Neurocomputing, 1998, 19(1―3): 223―257
doi: 10.1016/S0925-2312(97)00091-X
|
|
Xu L. BayesianKullback Ying-Yang dependence reduction theory. Neurocomputing, 1998, 22(1―3): 81―111
doi: 10.1016/S0925-2312(98)00051-4
|
|
Xu L. BYYharmony learning, independent state space, and generalized APT financialanalyses. IEEE Transactions on Neural Networks, 2001, 12(4): 822―849
doi: 10.1109/72.935094
|
|
Xu L. Bestharmony, unified RPCL and automated model selection for unsupervisedand supervised learning on Gaussian mixtures, three-layer nets andME-RBF-SVM models. International Journalof Neural Systems, 2001, 11(1): 43―69
doi: 10.1016/S0129-0657(01)00049-7
|
|
Xu L. BYYharmony learning, structural RPCL, and topological self-organizingon mixture models. Neural Networks, 2002, 15(8―9): 1125―1151
doi: 10.1016/S0893-6080(02)00084-9
|
|
Pearl J. ProbabilisticReasoning in Intelligent Systems: Networks of Plausible Inference. San Mateo, CA: Morgan Kaufmann, 1988
|
|
Ikeda S, Tanaka T, Amari S. Information geometry of turbo and low-density parity-checkcodes. IEEE Transactions on InformationTheory, 2004, 50(6): 1097―1114
doi: 10.1109/TIT.2004.828072
|
|
Ikeda S, Tanaka T, Amari S. Stochastic reasoning, free energy, and information geometry. Neural Computation, 2004, 16(9): 1779―1810
doi: 10.1162/0899766041336477
|
|
Csiszár I. Informationmeasures: A critical survey. In: Transactionsof the 7th Prague Conference. 1974, 83―86
|
|
Csiszár I. Axiomaticcharacterizations of information measures. Entropy, 2008, 10(3): 261―273
doi: 10.3390/e10030261
|
|
Ali M S, Silvey S D. A general class of coefficientsof divergence of one distribution from another. Journal of the Royal Statistical Society. Series B, 1966, 28(1): 131―142
|
|
Amari S. α-divergence is unique, belonging toboth f-divergence and Bregman divergenceclasses. IEEE Transactions on InformationTheory, 2009, 55(11): 4925―4931
doi: 10.1109/TIT.2009.2030485
|
|
Cichocki A, Adunek R, Phan A H, Amari S. NonnegativeMatrix and Tensor Factorizations. JohnWiley, 2009
doi: 10.1002/9780470747278
|
|
Havrda J, Charvát F. Quantification method ofclassification process: Concept of structural α-entropy. Kybernetika, 1967, 3: 30―35
|
|
Chernoff H. Ameasure of asymptotic efficiency for tests of a hypothesis based onthe sum of observations. Annals of MathematicalStatistics, 1952, 23(4): 493―507
doi: 10.1214/aoms/1177729330
|
|
Matsuyama Y. The α-EM algorithm: Surrogate likelihoodmaximization using α-logarithmicinformation measures. IEEE Transactionson Information Theory, 2002, 49(3): 672―706
|
|
Amari S. Integrationof stochastic models by minimizing α-divergence. Neural Computation, 2007, 19(10): 2780―2796
doi: 10.1162/neco.2007.19.10.2780
|
|
Amari S. Informationgeometry and its applications: Convex function and dually flat manifold. In: Nielsen F ed. Emerging Trends in Visual Computing. LectureNotes in Computer Science, Vol 5416. Berlin: Springer-Verlag, 2009, 75―102
doi: 10.1007/978-3-642-00826-9_4
|
|
Eguchi S, Copas J. A class of logistic-typediscriminant functions. Biometrika, 2002, 89(1): 1―22
doi: 10.1093/biomet/89.1.1
|
|
Murata N, Takenouchi T, Kanamori T, Eguchi S. Informationgeometry of U-boost and Bregmandivergence. Neural Computation, 2004, 16(7): 1437―1481
doi: 10.1162/089976604323057452
|
|
Minami M, Eguchi S. Robust blind source separationby beta-divergence. Neural Computation, 2002, 14(8): 1859―1886
doi: 10.1162/089976602760128045
|
|
Byrne W. Alternatingminimization and Boltzmann machine learning. IEEE Transactions on Neural Networks, 1992, 3(4): 612―620
doi: 10.1109/72.143375
|
|
Amari S, Kurata K, Nagaoka H. Information geometry of Boltzmann machines. IEEE Transactions on Neural Networks, 1992, 3(2): 260―271
doi: 10.1109/72.125867
|
|
Amari S. Naturalgradient works efficiently in learning. Neural Computation, 1998, 10(2): 251―276
doi: 10.1162/089976698300017746
|
|
Amari S, Takeuchi A. Mathematical theory on formationof category detecting nerve cells. BiologicalCybernetics, 1978, 29(3): 127―136
doi: 10.1007/BF00337348
|
|
Jordan M I. Learning in Graphical Models. Cambridge,MA: MIT Press, 1999
|
|
Yuille A L. CCCP algorithms to minimize the Bethe and Kikuchi free energies:Convergent alternatives to belief propagation. Neural Computation, 2002, 14(7): 1691―1722
doi: 10.1162/08997660260028674
|
|
Yuille A L, Rangarajan A. The concave-convex procedure. Neural Computation, 2003, 15(4): 915―936
doi: 10.1162/08997660360581958
|
|
Opper M, Saad D. Advanced Mean Field Methods―Theoryand Practice. Cambridge, MA: MIT Press, 2001
|
|
Tanaka T. Informationgeometry of mean-field approximation. NeuralComputation, 2000, 12(8): 1951―1968
doi: 10.1162/089976600300015213
|
|
Amari S, Ikeda S, Shimokawa H. Information geometry and mean field approximation: The α-projection approach. In: Opper M, Saad D, eds. Advanced Mean Field Methods―Theory and Practice. Cambridge, MA: MIT Press, 2001, 241―257
|
|
Viewed |
|
|
|
Full text
|
|
|
|
|
Abstract
|
|
|
|
|
Cited |
|
|
|
|
|
Shared |
|
|
|
|
|
Discussed |
|
|
|
|