Further Reading: Chapter 5

« Back to Table of Contents

The theory of optimisation dates back to the work of Fermat, who formulated the result on stationarity for unconstrained problems in the 17th century. The extension to the constrained case was made by Lagrange in 1788, for the case of equality constraints. It was not until 1951 that the theory was generalised to the case of inequality constraints by Kuhn and Tucker, giving rise to the modern theory of convex optimisation. Karush had already described the optimality conditions in his dissertation in 1939 and this is why the conditions arising from the Kuhn Tucker theorem are usually referred to as the Karush Kuhn Tucker (KKT) conditions.

In the following years, considerable work was done by Wolfe, Mangasarian, Duran, and others, to extend the duality results known for linear programming to the convex programming case (see for example the introduction of Luenberger). The diffusion of computers in the 1960s greatly increased the interest in what was then known as mathematical programming, which studies the solution of problems by (usually linear or quadratic) optimisation methods.

The use of optimisation in machine learning was pioneered by Olvi Mangasarian with his work on linear programming machines. See also the early paper by Smith, and by Bennett and Mangasarian for early use of the slack variables in a pattern separation problem and Bennett et al. for a very nice discussion of linear and quadratic optimisation techniques applied to pattern recognition. Mangasarian took his ideas to their extreme, designing algorithms that can perform data mining on datasets of hundreds of thousands of points (see Chapter 7 for more references). The perceptron algorithm described in Chapter 2 can also be regarded as a simple optimisation procedure, searching for a feasible point given a set of linear constraints specified by the data. But rather than picking any point in the feasible region (an ill-posed problem) one could choose to pick some specific point satisfying some extremal property like the ones discussed in Chapter 4 such as being maximally distant from the boundaries of the feasible region. This type of consideration will lead to the development of Support Vector Machines in the next chapter. Mangasarian’s early work mainly focused on minimising the 1-norm of the solution w. Notice finally that the application of optimisation ideas to problems like least squares regression (discussed in Chapter 2) was already a use of such concepts in machine learning.

Optimisation theory is a well-developed and quite stable field, and the standard results summarised in this chapter can be found in any good textbook. A particularly readable and comprehensive text on optimisation theory is Bazaraa et al; also the classic books of  Luenberger, Fletcher and  Mangasarian provide good introductions. Optimisation theory usually also includes the algorithmic techniques needed to solve the problem practically. We will address this issue in Chapter 7. The important contribution of optimisation theory to the theory of Support Vector Machines, however, lies not on the algorithmic side, but rather in its providing a mathematical characterisation of the solutions, via the KKT conditions, giving a mathematical meaning to the dual variables (introduced in Chapter 2) and to the margin slack vector (introduced in Chapter 4), and generally in providing a geometrical intuition for the dual problem.

Exercise 1 concerning the centre of the smallest ball containing the data is relevant if we wish to minimise the estimate of the fat shattering dimension of a given set of points by optimally shifting the origin. This problem was first studied in Schoelkopf et al., while Exercise 2 concerning the distance between convex hulls is discussed in Bennett et al, and provides an efficient way to characterise the maximal margin hyperplane. This is discussed in Keerthy et al. where it is used to motivate an interesting algorithm (see Chapter 7 for more references).

References not on-line

  • W. Karush. Minima of Functions of Several Variables with Inequalities as Side Constraints. Department of Mathematics, University of Chicago, 1939. (MSc Thesis).
  • H. Kuhn and A. Tucker. Nonlinear programming. Proceedings of 2nd Berkeley Symposium on Mathematical Statistics and Probabilistics, pages 481–492. University of California Press, 1951.
  • D. Luenberger.  Linear and Nonlinear Programming . Addison-Wesley, 1984.
  • O. L. Mangasarian. Linear and nonlinear separation of patterns by linear programming. Operations Research , 13:444–452, 1965.
  • O. L. Mangasarian. Multi-surface method of pattern separation. IEEE Transactions on Information Theory , IT-14:801–807, 1968.
  • O. L. Mangasarian. Mathematical programming in data mining. Data Mining and Knowledge Discovery , 42(1):183–201, 1997.
  • F. W. Smith. Pattern classifier design by linear programming. IEEE Transactions on Computers , C-17:367–372, 1968.
  • M. Bazaraa, D. Sherali, and C. Shetty.  Nonlinear Programming : Theory and Algorithms . Wiley-Interscience Series in Discrete Mathematics and Optimization. Wiley, 1992.
  • R. Fletcher. Practical methods of Optimization . Wiley, 1988.
  • O. L. Mangasarian. Nonlinear Programming. SIAM, 1994.