Further Reading: Chapter 6

« Back to Table of Contents

Support Vector Machines are a very specific class of algorithms, characterised by the use of kernels, the absence of local minima, the sparseness of the solution and the capacity control obtained by acting on the margin, or on other `dimension independent’ quantities such as the number of support vectors. They were invented by Vladimir Vapnik and his co-workers, and first introduced with a paper at the COLT 1992 conference. All of these features, however, were already present and had been used in machine learning since the 1960s: large margin hyperplanes in the input space were discussed for example by Duda and Hart, Cover, Vapnik et al, and several statistical mechanics papers (for example Anlauf and Biehl); the use of kernels was proposed by Aronszajn, Wahba, Poggio, and others, but it was the paper by Aizermann et al. in 1964 that introduced the geometrical interpretation of the kernels as inner products in a feature space. Similar optimisation techniques were used in pattern recognition by Mangasarian, and the sparseness had also already been discussed. See also Hassoun for related early work. The use of slack variables to overcome the problem of noise and non-separability was also introduced in the 1960s by Smith and improved by Bennett and Mangasarian. However, it was not until 1992 that all of these features were put together to form the maximal margin classifier, the basic Support Vector Machine, and not until 1995 that the soft margin version was introduced by Cortes and Vapnik: it is surprising how naturally and elegantly all the pieces fit together and complement each other. The papers by Shawe-Taylor et al. and Bartlett gave the first rigorous statistical bound on the generalisation of hard margin SVMs, while the paper Shawe-Taylor and Cristianini gives similar bounds for the soft margin algorithms and for the regression case.

After their introduction, an increasing number of researchers have worked on both the algorithmic and theoretical analysis of these systems, creating in just a few years what is effectively a new research direction in its own right, merging concepts from disciplines as distant as statistics, functional analysis, optimisation, as well as machine learning. The soft margin classifier was introduced a few years later by Cortes and Vapnik, and in 1995 the algorithm was extended to the regression case.

The two recent books written by Vapnik provide a very extensive theoretical background of the field and develop the concept of a Support Vector Machine.

Since most recent advances in kernels, learning theory, and implementation are discussed at the ends of the relevant chapters, we only give a brief overview of some of the improvements recently obtained from the point of view of the overall algorithm. Work has been done on generalisations of the method, and extensions to the multiclass case (Weston and Watkins; Platt, Cristianini and Shawe-Taylor; Vapnik). More general regression scenarios have also been studied: Smola, Schoelkopf and Mueller discussed a very general class of cost functions., and the use of ridge regression in feature space was considered by Saunders et al. and by Smola and Scholekopf.

Ridge regression is a special case of regularisation networks. The concept of regularisation was introduced by Tikhonov, and applied to learning in the form of regularisation networks by Girosi, Jones, Poggio. The relation between regularisation networks and Support Vector Machines has been explored by a number of authors (Girosi, Wahba, Smola, Schoelkopf, Mueller,, see also Evgeniou, Pontil and Poggio).

The “\nu” Support Vector algorithm for classification and regression described in Remark 6.13 was introduced and developed by Smola, Schoelkopf, Williamson and Bartlett. Other adaptations of the basic approach have been used for density estimation, transduction, Bayes Point Estimation, ordinal regression. Rifkyn et al. show that for some degenerate training sets the soft margin gives a trivial solution.

Theoretical advances that do not fit within the framework of Chapter 4 include the analyses of generalisation given in the article of Dietrich, Opper and Sompolinsky, which provides a statistical mechanical analysis of SVMs, the papers by Jaakkola and Haussler, Vapnik and Chapelle, Weston and Herbrich, Wahba, Lin, Zhang, Opper and Winther, which provide a cross-validation analysis of the expected error, and the book Vapnik, which gives an expected error bound in terms of the margin and radius of the smallest ball containing the essential support vectors.

Extensions of the SVM concept have been made by several authors, for example Mangasarian’s generalised SVMs. Particularly interesting is the development of a system, called the Bayes Point Machine, that enforces another inductive principle, given by Bayesian generalisation theory. Albeit losing the feature of sparseness, this system exhibits excellent performance, and illustrates the important point that this class of algorithms is not limited to the use of margin bounds. Another example of a system similar to SVMs that does not enforce a margin bound is given by Gaussian processes.

More recently, a number of practical applications of SVMs have been reported, in fields as diverse as bioinformatics, computational linguistics and computer vision (some of which are reported in Chapter 8). Many of these recent advances are reported in the collections Schoelkopf et al. and Smola et al., by Burges, by Smola and Schoelkopf and by Evgeniou, Pontil and Poggio.

Finally, the PhD dissertations of Cortes, Osuna, Schoelkopf and Smola provide a valuable first hand source of research topics including work on the feasibility gap.

Gaussian processes are surveyed in this paper by Williams, and in the dissertations of Rasmussen and Gibbs. Extensions of Gaussian processes to the classification case have also been undertaken but fall beyond the scope of this book.

References not on-line

Duda R.O. and Hart P.E.; Patter Classification and Scene Analysis. Wiley, 1973.
T.M. Cover. Geometrical and statistical properties of systems of linear inequalities with applications in pattern recognition. IEEE Transactions on Electronic Computers}, 14:326–334, 1965.
V.Vapnik and A.Lerner. Pattern recognition using generalized portrait method. Automation and Remote Control}, 24, 1963.
V.Vapnik and A.Chervonenkis. A note on one class of perceptrons. Automation and Remote Control}, 25, 1964.
J.K. Anlauf and M.Biehl. The adatron: an adaptive perceptron algorithm. Europhysics Letters, 10:687–692, 1989.
N.Aronszajn. Theory of reproducing kernels. Transactions of the American Mathematical Society, 68:337–404, 1950.
M.Aizerman, E.Braverman, and L.Rozonoer.Theoretical foundations of the potential function method in pattern recognition learning. Automation and Remote Control 25:821–837, 1964.
O. L. Mangasarian. Linear and nonlinear separation of patterns by linear programming. Operations Research , 13:444–452, 1965.
F. W. Smith. Pattern classifier design by linear programming. IEEE Transactions on Computers , C-17:367–372, 1968.
C.Cortes and V.Vapnik. Support vector networks. Machine Learning, 20:273–297, 1995.
V.Vapnik. The Nature of Statistical Learning Theory}. Springer Verlag, 1995.
V.Vapnik. Statistical Learning Theory}. Wiley, 1998.
A.N. Tikhonov and V.Y. Arsenin. Solutions of Ill-posed Problems.W. H. Winston, 1977.
B.Schoelkopf, C.J.C. Burges, and A.J. Smola, Advances in kernel methods – support vector learning, MIT Press, Cambridge, MA, 1999.
A.J. Smola, P.Bartlett, B.Schoelkopf, and C.Schuurmans, Advances in large margin classifiers, MIT Press, Cambridge, MA, 1999.