Looking for the 2021 site?

36-708: The ABCDE of Statistical Methods in Machine Learning

(aka: developing judgment for complex stat-ml methods)

Class location and time: DH 1211 (MW 10:30-11:50am)

Office hours locations and time: Aaditya: BH 132H (MW 12-12:30pm), Pratik: GHC 8106 (T 2-3pm)

Course details:

Scribes:

Homeworks:

Lectures:

L01 (Jan 13): Introduction
- Course syllabus
L02 (Jan 15): Basics of supervised learning: regression, classification
- Scribe note (Allie Del Giorno)
- Overview of supervised learning (Hastie, Tibshirani, Friedman, 2017) [Elements of statistical learning, chapter 01]
- Statistical learning (Jamse, Witten, Hastie, Tibshirani, 2017) [An introduction statistical learning, chapter 02]
L-- (Jan 20): No class (MLK day)
L03 (Jan 22): Nearest-neighbor methods: k-nn regression and classification
- Scribe note (Yunhan Wen)
- Lectures on the nearest neighbor method (Blau, Devroye, 2015)
- Prototype methods and nearest-neighbors (Hastie, Tibshirani, Friedman, 2017) [Elements of statistical learning, chapter 13]
- Nearest neighbor pattern classification (Cover, Hart, 1967)
- Kernel and nearest-neighbor estimation of a conditional quantile (Bhattacharya, Gangopadhyay, 1990)
L04 (Jan 27): Predictive inference: conformal prediction
- Scribe note (Ian Waudby-Smith)
- Conformal prediction (Vovk, 2005) [Algorithmic learning in a random world, chapter 02]
- A tutorial on conformal prediction (Shafer, Vovk, 2008)
- Distribution-free predictive inference for regression (Lei, G'Sell, Rinaldo, Tibshirani, Wasserman, 2017)
L05 (Jan 29): Ensemble methods: boosting (game-theoretic perspective)
- Scribe note (Sasha Podkopaev)
- Boosting (Mohri, Rostamizadeh, Talwalkar, 2018) [Foundations of machine learning, chapter 07]
- The strength of weak learnability (Schapire, 1990)
- Boosting a weak learning algorithm by majority (Freund, 1995)
- The weighted majority algorithm (Littlestone, Warmuth, 1992)
- A decision-theoretic generalization of on-line learning and an application to boosting (Freund, Schapire, 1997)
L06 (Feb 03): Ensemble methods: boosting (statistical perspective)
- Scribe note (Weichen Wu)
- Boosting (Foundations of machine learning, chapter 07)
- Potential boosters (Duffy, Helmbold, 1999)
- Boosting algorithms as gradient descnet (Mason, Baxter, Bartlett, Frean, 1999)
- Greedy function approximation: a gradient boosting machine (Friedman, 2001)
L07 (Feb 05): Ensemble methods: boosting (computational considerations, applications), guest lecture by Allie
- Scribe note (Tuhinangshu Choudhury)
- SpeedBoost: anytime prediction with uniform near-optimality (Grubb, Bagnell, 2012)
L08 (Feb 10): Ensemble methods: boosting (generalization)
- Scribe note (Rajshekar Das)
- Boosting (Foundations of machine learning, chapter 07)
- Boosting the margin: a new explanation for the effectiveness of voting methods
L09 (Feb 12): Quiz 1
- Topics: basics (supervised learning), prototype methods (nearest-neighbor methods), predictive inference (conformal prediction), ensemble methods (boosting)
L10 (Feb 17): Ensemble methods: bagging, random forests
- Scribe note (Andrew Warren)
- Random forests (Hastie, Tibshirani, Friedman, 2017) [Elements of statistical learning, chapter 15]
- Tree-based methods (Jamse, Witten, Hastie, Tibshirani, 2017) [An introduction statistical learning, chapter 08]
- Bagging predictors (Leo Breiman, 1994)
L11 (Feb 19): Variable importance: random forests case study
- Scribe note (Amanda Coston)
- Random decision forests (Ho, 1995)
- Random forests (Breiman, 2001)
- Additional reading: Phase stop permuting features: an explanation and alternatives (Hooker, Mentch, 2019)
- Additional reading: Getting better from worse: augmented bagging and a cautionary tale of variable importance (Mentch, Zhou, 2020)
L12 (Feb 24): Datapoint importance: Shapley values
- Scribe note (Zeyu Tang)
- Notes on n-person game -- II: The value of an n-person game (Shapley, 1951)
- A unified approach to interpreting model predictions (Lundberg, Lee, 2017)
- Data Shapley: Equitable valuation of data for machine learning (Ghorbani, Zou, 2019)
- Shapley values (Molnar, 2020) [Interpretable machine learning, chapter 05]
- Problems with Shapley-value-based explanations as feature importance measures (Kumar, Venkatasubramanian, Scheidegger, Friedler, 2020)
L13 (Feb 26): Ensemble methods: stacking
- Scribe note (Don Dennis)
- Stacked generalization (Wolpert, 1992)
- Stacked regressions (Breiman, 1996)
L14 (Mar 02): Predictive inference: jackknife+
- Scribe note (Max Rubinstein)
- Predictive inference with the jackknife+ (Barber, Candes, Ramdas, Tibshirani, 2019)
- Predictive inference is free with the jackknife+after-bootstrap (Kim, Xu, Barber, 2020)
L15 (Mar 04): Predictive inference: leave-one-out
- Scribe note (Naveen Basavaraj)
- Model assessment and selection (Hastie, Tibshirani, Friedman, 2017) [Elements of statistical learning, chapter 07]
- Resampling methods (James, Witten, Hastie, Tibshirani, 2017) [An introduction statistical learning, chapter 05]
L-- (Mar 09): No class (spring break)
L-- (Mar 11): No class (spring break)
L16 (Mar 16): No class (no class due to COVID-19 online preparation)
L17 (Mar 18): Mid-class reap: (methods/rows) k-nearest-neighbor; boosting; bagging; random forest; stacking; (aspects/columns) algorithms; bias-variance; computation, conformal; (practical) data aspects; explainability, interpretability
- Scribe note (Lucio Dery)
L18 (Mar 23): Quiz 2
- Topics: variable importance (random forest), data importance (Shapley values), ensemble methods (stacking), predictive inference (jackknife+, leave-one-out)
L19 (Mar 25): Kernel learning: basics (RKHS intro)
- Scribe note (Ankur Mallick)
- Kernel methods (Foundations of machine learning, chapter 06)
- Introduction to RKHS (Gretton, 2015)
L20 (Mar 30): Kernel learning: basic (RKHS equivalences)
- Scribe note (Lorenzo Tomaselli)
- Kernel methods (Foundations of machine learning, chapter 06)
- Mappings of Probabilities to RKHS and applications (Gretton, 2015)
L21 (Apr 01): Kernel learning: basics (universal/characteristic kernel)
- Scribe note (Nick Kissel)
- Kernel methods (Foundations of machine learning, chapter 06)
- Mappings of Probabilities to RKHS and applications (Gretton, 2015)
L22 (Apr 06): Kernel learning: regression, kernel classification (kernel ridge regression, kernel SVM, kernel logistic regression)
- Scribe note (Mike Stanley)
- Regression (Foundations of machine learning, chapter 11)
- Dependence measures using RKHS embeddings (Gretton, 2015)
L23 (Apr 08): Unsupervised learning: clustering (kernel hierarchical clustering, k-means clustering)
- Scribe note (not available, sorry!)
- Unsupervised learning (Hastie, Tibshirani, Friedman, 2017) [Elements of statistical learning, chapter 14]
- k-means++: the advantage of careful seeding (Arthur, Vassilvitskii, 2006)
L24 (Apr 13): Unsupervised learning: dimension reduction (PCA, kernel PCA)
- Scribe note (Misha Khodak)
- Unsupervised learning (Hastie, Tibshirani, Friedman, 2017) [Elements of statistical learning, chapter 14]
- Resampling methods (James, Witten, Hastie, Tibshirani, 2017) [An introduction statistical learning, chapter 05]
- Kernel tricks and nonlinear dimensionality reduction via RBF kernel PCA (Raschka, 2014)
- Kernel Principal Component Analysis and its Applications in Face Recognition and Active Shape Models (Wang, 2012)
L25 (Apr 15): Unsupervised learning: diemsnsion reduction (stochastic PCA, depp PCA, autoencoders)
- Scribe note (not available, sorry!)
- Unsupervised learning (Hastie, Tibshirani, Friedman, 2017) [Elements of statistical learning, chapter 14]
- Unsupervised learning (James, Witten, Hastie, Tibshirani, 2017) [An introduction to statistical learning, chapter 10]
- The Fast Convergence of Incremental PCA (Balsubramani, Dasgupta, Freund, 2015)
- Extracting and Composing Robust Features with Denoising Autoencoders (Vincent, Larochelle, Bengion, Manzagol, 2008)
- Autoencoders, Unsupervised Learning, and Deep architectures (Baldi, 2012)
- Stacked denoising autoencoders: learning useful representations in a deep network with a local denoising criterion (Vincent, Larochelle, Lajoie, Manzagol, 2010)
- A PCA-like autoencoder (Ladjal, Newson, Pham, 2019)
L26 (Apr 17): Guest lecture by Lucas Mentch
- Lecture slides
L27 (Apr 20):
- Scribe note (Jenn Williams)
- Neural networks (Hastie, Tibshirani, Friedman, 2017) [Elements of statistical learning, chapter 11]
- Breaking the curse of dimensionality with convex neural networks (Bach, 2017)
- Universal approximation bounds for superpositions of a sigmoid function (Barron, 1993)
- Approximation of superpositions of a sigmoidal function (Cybenko, 1989)
- Approximations of continuous functionals by neural networks with applications to dynamics systems (Chen, Chen, 1993)
- Universal approximation too nonlinear operators by neural networks with arbitrary activation functions and its applications to dynamical systems (Chen, Chen, 1995)
- Gradient descent only converges to minimizers (Lee, Simchowitz, Jordan, Recht, 2016)
- First-order methods almost always avoid saddle points: the case of vanishing step-sizes (Panageas, Piliouras, Wang, 2019)
- Benefits of depth in neural networks (Telgarsky, 2016)
L28 (Apr 22): Quiz 3
- Topics: kernels (basics, regression, classification), unsupervised learning (clustering, PCA, kernel PCA, stochastic PCA, deep PCA, autoencoders)
L29 (Apr 27): Unsupervised leaerning (ICA, CCA, SDR)
- Scribe note (Zeyu Tang)
- Unsupervised learning (Hastie, Tibshirani, Friedman, 2017) [Elements of statistical learning, chapter 14]
- Independent componnet analysis: algorithms and analysis (Hyvarinen, Oja, 2000)
- Canonical correlation: a tutorial (Borga)
- Sliced inverse regression for dimension reduction (Ker-Chau Li)
L30 (Apr 29): Calibration and end-class recap
- Scribe note (Dhivya Eswaran)
- Obtaining calibrated probability estimates from decision trees and naive Bayesian classifiers (Zadrozny, Elkan, 2001)
- Machine learning for everyone
- Topics maps (row-focused)
- Topics map (row-focused)
- Topics map (row-focused)

Some textbooks:

The Elements of Statistical Learning: Data Mining, Inference and Prediction. Trevor Hastie, Robert Tibshirani, Jerome Friedman.
An Introduction to Statistical Learning: With Applications in R. Gareth James, Daniela Witten, Trevor Hastie, Robert Tibshirani.
Foundations of Machine Learning. Mehryar Mohri, Afshin Rostamizadeh, and Ameet Talwalkar.

References to catch up on prerequisites:

Videos of Larry Wasserman's 36-705 course
Linear algebra review, videos by Zico Kolter
Real analysis, calculus, and more linear algebra, videos by Aaditya Ramdas
Convex optimization prequisites review from Spring 2015 course, by Nicole Rafidi
See also Appendix A of Boyd and Vandenberghe (2004) for general mathematical review

Potentially useful resources available online (not necessarily verified by the course staff):