Looking for the 2021 site?
36-708: The ABCDE of Statistical Methods in Machine Learning
(aka: developing judgment for complex stat-ml methods)
Class location and time: DH 1211 (MW 10:30-11:50am)
Office hours locations and time: Aaditya: BH 132H (MW 12-12:30pm), Pratik: GHC 8106 (T 2-3pm)
Course details:
Scribes:
Homeworks:
Lectures:
- L01 (Jan 13): Introduction
- L02 (Jan 15): Basics of supervised learning: regression, classification
- L-- (Jan 20): No class (MLK day)
- L03 (Jan 22): Nearest-neighbor methods: k-nn regression and classification
- Scribe note (Yunhan Wen)
- Lectures on the nearest neighbor method (Blau, Devroye, 2015)
- Prototype methods and nearest-neighbors (Hastie, Tibshirani, Friedman, 2017) [Elements of statistical learning, chapter 13]
- Nearest neighbor pattern classification (Cover, Hart, 1967)
- Kernel and nearest-neighbor estimation of a conditional quantile (Bhattacharya, Gangopadhyay, 1990)
- L04 (Jan 27): Predictive inference: conformal prediction
- L05 (Jan 29): Ensemble methods: boosting (game-theoretic perspective)
- Scribe note (Sasha Podkopaev)
- Boosting (Mohri, Rostamizadeh, Talwalkar, 2018) [Foundations of machine learning, chapter 07]
- The strength of weak learnability (Schapire, 1990)
- Boosting a weak learning algorithm by majority (Freund, 1995)
- The weighted majority algorithm (Littlestone, Warmuth, 1992)
- A decision-theoretic generalization of on-line learning and an application to boosting (Freund, Schapire, 1997)
- L06 (Feb 03): Ensemble methods: boosting (statistical perspective)
- L07 (Feb 05): Ensemble methods: boosting (computational considerations, applications), guest lecture by Allie
- L08 (Feb 10): Ensemble methods: boosting (generalization)
- L09 (Feb 12): Quiz 1
- Topics: basics (supervised learning), prototype methods (nearest-neighbor methods), predictive inference (conformal prediction), ensemble methods (boosting)
- L10 (Feb 17): Ensemble methods: bagging, random forests
- Scribe note (Andrew Warren)
- Random forests (Hastie, Tibshirani, Friedman, 2017) [Elements of statistical learning, chapter 15]
- Tree-based methods (Jamse, Witten, Hastie, Tibshirani, 2017) [An introduction statistical learning, chapter 08]
- Bagging predictors (Leo Breiman, 1994)
- L11 (Feb 19): Variable importance: random forests case study
- L12 (Feb 24): Datapoint importance: Shapley values
- Scribe note (Zeyu Tang)
- Notes on n-person game -- II: The value of an n-person game (Shapley, 1951)
- A unified approach to interpreting model predictions (Lundberg, Lee, 2017)
- Data Shapley: Equitable valuation of data for machine learning (Ghorbani, Zou, 2019)
- Shapley values (Molnar, 2020) [Interpretable machine learning, chapter 05]
- Problems with Shapley-value-based explanations as feature importance measures (Kumar, Venkatasubramanian, Scheidegger, Friedler, 2020)
- L13 (Feb 26): Ensemble methods: stacking
- L14 (Mar 02): Predictive inference: jackknife+
- L15 (Mar 04): Predictive inference: leave-one-out
- L-- (Mar 09): No class (spring break)
- L-- (Mar 11): No class (spring break)
- L16 (Mar 16): No class (no class due to COVID-19 online preparation)
- L17 (Mar 18): Mid-class reap: (methods/rows) k-nearest-neighbor; boosting; bagging; random forest; stacking; (aspects/columns) algorithms; bias-variance; computation, conformal; (practical) data aspects; explainability, interpretability
- L18 (Mar 23): Quiz 2
- Topics: variable importance (random forest), data importance (Shapley values), ensemble methods (stacking), predictive inference (jackknife+, leave-one-out)
- L19 (Mar 25): Kernel learning: basics (RKHS intro)
- L20 (Mar 30): Kernel learning: basic (RKHS equivalences)
- L21 (Apr 01): Kernel learning: basics (universal/characteristic kernel)
- L22 (Apr 06): Kernel learning: regression, kernel classification (kernel ridge regression, kernel SVM, kernel logistic regression)
- L23 (Apr 08): Unsupervised learning: clustering (kernel hierarchical clustering, k-means clustering)
- L24 (Apr 13): Unsupervised learning: dimension reduction (PCA, kernel PCA)
- Scribe note (Misha Khodak)
- Unsupervised learning (Hastie, Tibshirani, Friedman, 2017) [Elements of statistical learning, chapter 14]
- Resampling methods (James, Witten, Hastie, Tibshirani, 2017) [An introduction statistical learning, chapter 05]
- Kernel tricks and nonlinear dimensionality reduction via RBF kernel PCA (Raschka, 2014)
- Kernel Principal Component Analysis and its Applications in Face Recognition and Active Shape Models (Wang, 2012)
- L25 (Apr 15): Unsupervised learning: diemsnsion reduction (stochastic PCA, depp PCA, autoencoders)
- Scribe note (not available, sorry!)
- Unsupervised learning (Hastie, Tibshirani, Friedman, 2017) [Elements of statistical learning, chapter 14]
- Unsupervised learning (James, Witten, Hastie, Tibshirani, 2017) [An introduction to statistical learning, chapter 10]
- The Fast Convergence of Incremental PCA (Balsubramani, Dasgupta, Freund, 2015)
- Extracting and Composing Robust Features with Denoising Autoencoders (Vincent, Larochelle, Bengion, Manzagol, 2008)
- Autoencoders, Unsupervised Learning, and Deep architectures (Baldi, 2012)
- Stacked denoising autoencoders: learning useful representations in a deep network with a local denoising criterion (Vincent, Larochelle, Lajoie, Manzagol, 2010)
- A PCA-like autoencoder (Ladjal, Newson, Pham, 2019)
- L26 (Apr 17): Guest lecture by Lucas Mentch
- L27 (Apr 20):
- Scribe note (Jenn Williams)
- Neural networks (Hastie, Tibshirani, Friedman, 2017) [Elements of statistical learning, chapter 11]
- Breaking the curse of dimensionality with convex neural networks (Bach, 2017)
- Universal approximation bounds for superpositions of a sigmoid function (Barron, 1993)
- Approximation of superpositions of a sigmoidal function (Cybenko, 1989)
- Approximations of continuous functionals by neural networks with applications to dynamics systems (Chen, Chen, 1993)
- Universal approximation too nonlinear operators by neural networks with arbitrary activation functions and its applications to dynamical systems (Chen, Chen, 1995)
- Gradient descent only converges to minimizers (Lee, Simchowitz, Jordan, Recht, 2016)
- First-order methods almost always avoid saddle points: the case of vanishing step-sizes (Panageas, Piliouras, Wang, 2019)
- Benefits of depth in neural networks (Telgarsky, 2016)
- L28 (Apr 22): Quiz 3
- Topics: kernels (basics, regression, classification), unsupervised learning (clustering, PCA, kernel PCA, stochastic PCA, deep PCA, autoencoders)
- L29 (Apr 27): Unsupervised leaerning (ICA, CCA, SDR)
- L30 (Apr 29): Calibration and end-class recap
Some textbooks:
References to catch up on prerequisites:
Potentially useful resources available online (not necessarily verified by the course staff):