Further Reading: Chapter 7
The problem of convex optimisation has been extensively studied since the 1950s, and a number of techniques exist to solve it. This chapter did not attempt to be exhaustive, but simply to provide the reader with some techniques that are both easy to implement and exemplify some standard approaches. Discussions of optimisation algorithms can be found in the books by Fletcher, Luenberger, Bazaraa, and Mangasarian.
For the particular case of Support Vector Machines, see the excellent surveys by Chris Burges and Smola and Schoelkopf, and the PhD dissertation of Osuna. Implementation problems and techniques are discussed by Kaufmann, Joachims, Platt, Osuna et al., Keerthi et al. -1 and Keerthi et al.- 2.
The gradient ascent techniques were introduced into the SVM literature by papers on the kernel-Adatron procedure; similar ideas have been used independently by Jaakkola and Haussler . They are also related to SMO with fixed b; and to Hildreth’s QP method. Mangasarian and his co-workers recently introduced algorithms that can deal with massive datasets (paper 1, paper 2, paper 3). Albeit with different motivations, the algorithm SOR (Successive Over Relaxation) of Mangasarian and Musicant is equivalent to the stochastic gradient ascent algorithm described in Section 7.2, combined with sample selection heuristics motivated by Platt‘s approach. Mangasarian and Musicant also give the proof of linear convergence of the algorithm using the link with SOR.
In Remark 7.4 we discussed the case when it is possible to use fixed bias; note that also Jaakkola and Haussler use this assumption; and Mangasarian and Musicant discuss the cases in which such an assumption is not too restrictive. Similar approaches have also been used in other papers.
An on-line theory of generalisation for gradient descent algorithms has been developed by Littlestone, Warmuth and others. A discussion of square loss algorithms can be found in Kivinen and Warmuth, while the theory of hinge loss is developed in Gentile and Warmuth. This theory provides tight bounds on the maximum number of mistakes an on-line algorithm can make in the worst case, and requires surprisingly few assumptions. The bounds can also be translated for the case where the data have been generated by a distribution. One of its best-known consequences is the motivation of multiplicative updating algorithms, and the characterisation of the cases in which they are expected to perform better than standard gradient descent. A multiplicative algorithm for Support Vector Machines has been studied by Cristianini et al
The elegant SMO algorithm was devised by Platt, and applied to text categorisation problems. Pseudocode (kindly supplied by John Platt) for SMO is available in the Appendix of this book. An extension of SMO, differing in the way it calculates the bias, has been proposed in Keerthi et al.- 2 and shown to be faster. Smola has generalised SMO for the case of regression, and the code can be found on the GMD website.
Keerthi et al. proposed a very elegant algorithm for SVMs that does not maximise the margin by minimising the norm of the weight vector. Rather, they note that the distance vector between the nearest points of the convex hulls of the positive and negative data uniquely determines the maximal margin hyperplane. This more geometrical view of the problem is discussed for example in Bennett et al and in Schoelkopf. Exercise 3 of Chapter 5 asks the reader to devise an optimisation problem for this criterion, hence motivating this alternative solution strategy for the maximum margin problem. Based on the same approach, Kowalczyk has proved a convergence rate for a new iterative algorithm that can be applied to the hard and soft margin problems, as well as extensive experimental comparisons with other iterative algorithms Guyon and Stork give another variant of an iterative algorithm for soft margin optimisation.
Chunking techniques in Support Vector Machines were already used by Vapnik and Chervonenkis, and were improved, generalised and discussed in a number of papers, among others Osuna et al, Joachims, Platt, Smola and Schoelkopf, and Kauffman. The work of Osuna and Girosi inspired the subsequent work on data selection, which ultimately led to systems like SMO.
Techniques exist for tuning the parameters automatically. For example the paper Cristianini et al. adjusts the kernel parameters while the “\nu”-SVMs allow the user to set an upper bound on the number of support vectors for classification and regression. The stopping criterion based on the feasibility gap is discussed in Smola, who proposes a criterion similar to that of equation 7.3. Other criteria are discussed in Keerthi et al. -1 and Keerthi et al.- 2.
The implementation of Gaussian processes is discussed in Gibbs’ paper and.his PhD Thesis. An experimental study of Widrow Hoff with kernels can also be found here.
Several practical optimisation books discuss general techniques for convex optimisation and give pointers to commercial optimisation packages. Early implementations of SVMs have been based on optimisation packages such as MINOS, LOQO, MATLAB optimisation package, and others mentioned above. Chapter 1 of the edited book Smola et al.contains a useful survey of different implementations.
The original reference for the conjugate gradient algorithm is the paper of Hestenes and Stiefel, while the discussion of Skilling methods for estimating the quality of the approximation is discussed in Gibbs and MacKay.
Online Software for Support Vector Machines and Gaussian Processes
Code for SMO implemented by Xianping Ge can be obtained from this site. The implementation of Support Vector Machines jointly developed by C. Saunders, M. O. Stitson, J. Weston, L. Bottou, B. Sch�lkopf and A. Smola at Royal Holloway, AT&T and GMD-FIRST is available here with its reference manual. Thorsten Joachims‘ implementation of Support Vector Machines, SVM light. Matlab implementation by Steve Gunn, MATLAB SVM toolbox. The “Tpros” and “Cpros” packages for Gaussian Processes written by Mark N. Gibbs are available at this page. Radford M. Neal‘s package Software for flexible Bayesian modeling, includes programs for neural networks, Gaussian processes and mixture models. The gp-map-1 and gp-mc-1 programs for Gaussian Processes written by Carl Edward Rasmussen are available through the DELVE web-site. An Octave based demonstration of Gaussian Processes written by David J. C. MacKay is also available. Quadratic Programming packages: R. Vanderbei’s LOQO interior point optimizer and Alex Smola‘s Quadratic Optimizer for SV Pattern Recognition based on the Interior Point Algorithm.
References not on-line
- M. Bazaraa, D. Sherali, and C. Shetty. Nonlinear Programming : Theory and Algorithms . Wiley-Interscience Series in Discrete Mathematics and Optimization. Wiley, 1992.
- R. Fletcher. Practical methods of Optimization . Wiley, 1988.
- O. L. Mangasarian. Nonlinear Programming. SIAM, 1994.
- D. Luenberger. Linear and Nonlinear Programming . Addison-Wesley, 1984.
- C. Hildreth, A quadratic programming procedure, Naval Research Logistics Quarterly 4 (1957), 79–85.
- J.J. More and S.J. Wright, Optimization software guide, Frontiers in Applied Mathematics, Vol 14, Society for Industrial and Applied Mathematics (SIAM), 1993.
- M. R. Hestenes and E. Stiefel. Methods of conjugate gradients for solving linear systems. Journal of Research of the National Bureau of Standards, 49(6) : 409:436, 1952
- R. Kowalczyk, Maximal Margin Perceptron. In A.J. Smola, P.Bartlett, B.Schoelkopf, and C.Schuurmans, Advances in large margin classifiers, MIT Press, Cambridge, MA, 1999.
- I.Guyon and D.Stork. Linear discriminant and support vector classifiers. In A.J. Smola, P.Bartlett, B. Schoelkopf, and C. Schuurmans, editors, Advances in Large Margin Classifiers. MIT Press, 1999.
- B.A. Murtagh and M. A. Saunders. MINOS 5.4 User’s guide. Technical Report SOL 83.20, Stanford University, 1993.
- L.Kaufmann, Solving the quadratic programming problem arising in support vector classification, Advances in Kernel Methods; in Support Vector Learning (Cambridge, MA) (B.Schoelkopf, C.J.C. Burges, and A.J. Smola, eds., MIT Press, 1999, pp.147–168.
- B.Schoelkopf, C.J.C. Burges, and A.J. Smola, Advances in kernel methods – support vector learning, MIT Press, Cambridge, MA, 1999.
- A.J. Smola, P.Bartlett, B.Schoelkopf, and C.Schuurmans, Advances in large margin classifiers, MIT Press, Cambridge, MA, 1999.