John Shawe-Taylor and Nello Cristianini. Kernel Methods for Pattern Analysis. Cambridge University Press 2004.
This is the latest iteration of one of the most popular books on support vector machines. In addition to the linear SVMs covered in this course, it also covers reproducing kernel Hilbert spaces and generalization bounds for SVMs. Definitely worth reading if you want to learn more about SVMs.
Robert Schapire and Yoram Singer. Improved boosting algorithms using confidence-rated prediction. Machine Learning 1999.
This paper introduces boosting algorithms for weak learners that output a yes/no answer together with a real-valued confidence. From the perspective of this course, it also includes proofs for the exponential rate of decrease in training error, together with derivations of the choice of alpha and the base classifiers. If you haven't seen AdaBoost before, this paper is the most complete introduction I know of.
Shai Shalev-Shwartz and Yoram Singer. Pegasos: Primal estimated sub-gradient solver for SVM. ICML 2007.
Until just a few years ago, SVM optimization was considered hard enough that only a few people in NLP would write their own SVM solvers. Instead, NLP researchers would download pre-packaged sovlers, massage their data as best as possible, and use those. That has changed in the past few years, and while this isn't the first paper to discuss stochastic gradient-style algorithms for SVMs, I think it is the most thorough and clearly written one. In addition to demonstrating impressive empirical results, Shalev-Schwartz and Singer also give an easy-to-understand proof of a O[(log T)/T] convergence bound.
Ralf Herbrich, Klaus OberMeyer, and Thore Graepel. Large margin rank boundaries for ordinal regression. Advances in Large Margin Classifiers 1999.
Yunbo Cao, Jun Xu, Tie-Yan Liu, Hang Li, Yalou Huang, Hsiao-Wuen Hon. Adapting Ranking SVM to Document Retrieval. SIGIR 2006.
One of the hottest applications for machine learning right now is "learning to rank", where the objective is to learn a ranking function given a query and a document. The first of these papers discusses an adaptation of support vector machines for ordinal regression which minimizes a loss based on pairs of items. The second shows how to adapt this loss to obtain better results for information retrieval.