Further Reading: Chapter 4

« Back to Table of Contents

The analysis of generalisation based on the VC dimension was developed by Vapnik and Chervonenkis starting from the mid 1960s. It has been successfully applied in different fields, motivating for example the development in statistics of a uniform law of large numbers. Most of its basic theoretical results were already present in Vapnik’s book. The development of the pac model of machine learning, on the other hand, can be traced to the seminal paper by Valiant in 1984, which laid the foundations of what became known as Computational Learning Theory: a large set of models describing, among others, on-line learning, query learning, unsupervised and supervised learning, and recently also applied to reinforcement learning. The introduction of the VC results within this theory, together with the lower bounds, is contained in the landmark paper by Blumer et al, and has greatly influenced the entire field of machine learning. VC theory has since been used to analyse the performance of learning systems as diverse as decision trees, neural networks, and others; many learning heuristics and principles used in practical applications of machine learning have been explained in terms of VC theory.

Many introductory books to Computational Learning Theory exist, for example Anthony and Biggs, Kearns & Vazirani, although they are often mainly focused on the pac model, somewhat neglecting the rich fields of on-line, query, and unsupervised learning. A good introduction to the statistical analysis of pattern recognition is given by Devroye et al. VC theory has recently also come to be known as Statistical Learning Theory , and is extensively described in the recent book of Vapnik, and in other books that preceded it, as well as earlier papers by Vapnik and Chervonenkis. An easy introduction to the theory is provided by Vidyasagar, and a complete account, including also very recent results, can be found in the book of Anthony and Bartlett. An international conference is held every year on computational learning theory, known as the COLT conference, where new results are presented. Websites such as www.neurocolt.com offer large repositories of recent papers.

The question of how the margin might affect generalisation was raised by many researchers including Duda and Hart, Vapnik and coworkers, and Mangasarian. Vapnik and Chervonenkis obtained bounds for the case when the margin is measured on the combined training and test sets using a quantity analogous to the fat-shattering dimension. The fat-shattering dimension itself (sometimes also called the scale-sensitive dimension, or $V_\gamma $ dimension) has appeared implicitly in a number of earlier references but was introduced into Computational Learning Theory by Kearns and Schapire and shown by Alon et al. to characterise the so-called Glivenko Cantelli classes. The fat-shattering dimension of linear classifiers was obtained by different authors (see Gurvits 97 and Shawe-Taylor et al), while the proof presented in this chapter appears in this paper by Bartlett and Shawe-Taylor.

The first papers to prove the large margin results were Shawe-Taylor et al‘s and Bartlett‘s papers, with the second reference containing the percentile result of Subsection 4.3.2. The soft margin bounds involving the margin slack vector for both classification and regression are due to Shawe-Taylor and Cristianini (see proceedings of Eurocolt 99, Colt 99, and the repository of NeuroCOLT for more details, as well as the survey by Shawe-Taylor and Cristianini) and use techniques similar to those discussed in Chapter 2 for obtaining mistake bounds in the non-separable case (see Chapter 2 for further references). Those results are summarised in two accessible surveys, by Bartlett and Shawe-Taylor and Shawe-Taylor and Cristianini. A quantity related to the margin slack vector is the so-called hinge loss, used by Gentile and Warmuth to obtain mistake bounds in the online learning framework. Margin analysis has since been applied to describe systems like Adaboost, Bayesian classifiers, decision trees, neural networks, and is now a standard tool in machine learning. The margin analysis has also been extended to take account of the margin of the test example.

Anthony and Bartlett used the fat-shattering dimension to obtain results for regression similar to Theorem 4.26 . Different analyses of generalisation are possible for regression, such as in Vapnik 98. The book of Anthony and Bartlett provides an excellent introduction to the analysis of regression.

The reason why margin analysis requires different tools from VC theory is that the quantity used to characterise the richness of a hypothesis class, the margin, depends on the data. Only after training the learning machine can one know what is the complexity of the resulting hypothesis. This style of analysis, which provides a way of exploiting benign collusions between the target function and input distribution, is often called data-dependent analysis, or data-dependent structural risk minimisation. The first data dependent result was Theorem 4.25 on the generalisation power of compression schemes and which is due to Littlestone, Floyd and Warmuth, while the paper Shawe-Taylor et al. introduced the general luckiness framework mentioned in Section 4.4. Other data dependent results include micro-choice algorithms, and PAC-Bayesian bounds. More recent results include using the eigenvalues of the Gram matrix, and the ones in Evgeniou et al.

Bayesian analysis is a traditional field within statistics, and has been applied to pattern recognition for a long time. In recent years, a new surge of interest in Bayesian analysis has come from the neural networks community, mainly thanks to the work of MacKay. An introduction to such analysis is provided by the books of Bishop and Neal. More recently, attention has been directed at Gaussian processes, a standard tool in statistics, described in Rasmussen and Williams. We will return to the subject of Gaussian processes in Chapter 6. A Bayesian analysis of generalisation of Gaussian processes has been performed by Sollich and Opper and Vivarelli. Other analyses of generalisation are possible, based on statistical mechanics (see for example the homepage of the group Wolfgang Kinzel at Wuerzburg), or on the theory of online algorithms, introduced by Littlestone and greatly developed by Warmuth.

References not on-line

  • V.Vapnik and A.Chervonenkis. On the uniform convergence of relative frequencies of events to their probabilities. Theory of Probability and its Applications, 16(2):264–280, 1971.
  • D.Pollard. Convergence of Stochastic Processes. Springer, 1984.
  • V.Vapnik. Estimation of Dependences Based on Empirical Data [in Russian]. Nauka, 1979.(English translation Springer Verlag, 1982).
  • L.G. Valiant. A Theory of the Learnable Communications of the ACM, 27(11), pp. 1134-1142, November 1984.
  • A.Blumer, A.Ehrenfeucht, D.Haussler, and M.K. Warmuth. Learnability and the Vapnik-Chervonenkis dimension. Journal of the ACM, 36(4):929–965, 1989.
  • M. Anthony and N. Biggs. Computational Learning Theory, volume 30 of Cambridge Tracts in Theoretical Computer Science. Cambridge University Press, 1992.
  • L. Devroye, L. Gyorfi, and G. Lugosi. A Probabilistic Theory of Pattern Recognition. Number 31 in Applications of mathematics. Springer, 1996.
  • M. Vidyasagar. A Theory of Learning and Generalization. Springer, 1997.
  • M. Anthony and P. Bartlett. Learning in Neural Networks : Theoretical Foundations. Cambridge University Press, 1999.
  • V. Vapnik. Statistical Learning Theory. Wiley, 1998.
  • V. Vapnik. Estimation of Dependences Based on Empirical Data [in Russian]. Nauka, 1979.(English translation Springer Verlag, 1982).
  • V. Vapnik. The Nature of Statistical Learning Theory. Springer Verlag, 1995.
  • V. Vapnik and A. Chervonenkis. On the uniform convergence of relative frequencies of events to their probabilities. Theory of Probability and its Applications, 16(2):264–280,1971.
  • V. Vapnik and A. Chervonenkis. Necessary and sufficient conditions for the uniform convergence of means to their expectations. Theory of Probability and its Applications, 26(3):532–553, 1981.
  • V. Vapnik and A. Chervonenkis. The necessary and sufficient conditions for consistency in the empirical risk minimization method. Pattern Recognition and Image Analysis, 1(3):283–305, 1991.
  • Duda R.O. and Hart P.E.;  Pattern Classification and Scene Analysis. Wiley, 1973.
  • V. Vapnik and A. Lerner. Pattern recognition using generalized portrait method. Automation and Remote Control, 24, 1963.
  • V. Vapnik and A. Chervonenkis. Theory of Pattern Recognition [in Russian]. Nauka, 1974.(German Translation: W. Wapnik & A. Tscherwonenkis,  Theorie der Zeichenerkennung, Akademie-Verlag, Berlin, 1979).
  • L. Gurvits. A note on a scale-sensitive dimension of linear bounded functionals in Banach spaces. In   Proceedings of Algorithmic Learning Theory, ALT-97, 1997.
  • C. M. Bishop. Neural Networks for Pattern Recognition. Clarendon Press, 1995.
  • R. Neal. Bayesian Learning in Neural Networks. Springer Verlag, 1996.