MSRA-HIT Summer School Course: Supervised and Semi-supervised Learning with Linear Models

References

Lawrence Rabiner. A tutorial on hidden Markov models and selected applications in speech recognition. Readings In Speech Recognition. 1990.

Probably the definitive tutorial on hidden Markov models. It covers inference, parameter estimation, and applications to speech recognition. If you aren't familiar with HMMs, this is worth reading.

Kristina Toutanova. Competitive generative models with structure learning for NLP classification tasks. EMNLP 2006.

This paper describes a method for weakening the independence assumptions inherent in naive Bayes in a computationally and statistically efficient way. The method greedily adds edges to a graphical model with the objective of maximizing likelihood. Toutanova shows that under certain conditions, generative models with the right structure can out-perform discriminative models.

Andrew Ng and Michael Jordan. On discriminative vs. generative classifiers: A comparison of naive Bayes and logistic regression. NIPS 2001.

Ng and Jordan examine naive Bayes and logistic regression from a theoretical perspective. Their main focus is on rates of convergence for the two models. They note that the asymptotic error of a logistic regression model is lower than that of the naive Bayes model, but a naive Bayes model can have much faster (statistical) convergence to its minimum error rate.

Percy Liang and Michael Jordan. An asymptotic analysis of generative, discriminative, and pseudo-likelihood estimators. ICML 2008.

Percy Liang and Michael Jordan examine the asymptotic distribution of risk for several estimators. Their analysis focuses on learning with structured outputs, and they include, in addition to discriminative and generative, estimators based on pesuedo-likelihood inference. Among their interesting conclusions is that the asymptotic estimation error of the (ML) generative model is smaller than the asymptotic estimation error of the discriminative model if the model family is well specified -- that is, if the generating distribution comes from that model family.