MVE440/MSA220, Statistical Learning for Big Data, Spring 19

Latest news

Welcome to MVE440/MSA220 Statistical Learning for Big Data!

Teachers

Course coordinator: Felix Held (felix.held@chalmers.se)

Examinator: Rebecka Jörnsten (jornsten@chalmers.se)

Teaching assistant: Juan Inda Diaz (inda@chalmers.se)

Project presentations will be handled by all three of us.

Current office hours in News section.

Course literature

Most of the course content can be found in

Hastie, T, Tibshirani, R, and Friedman, J (2009) The Elements of Statistical Learning: Data Mining, Inference, and Prediction. 2nd ed. New York: Springer Science+Business Media, LLC

This book is freely available online. If I refer to ESL, then I mean this book.

Other helpful books

Lectures

This is a preliminary lecture plan. It will be updated after each lecture to reflect reality. In addition, I'll link to lecture slides, black board notes (at least the ones that were planned) as well as the code to recreate the plots or numerical results presented in the lecture.

Week Day Recommended Reading Contents Files
13 Monday ESL Ch. 1 and 2 Course Overview, Introduction to Big Data and Statistical Learning Lecture 1: Slides (post-lecture), Notes, Code
13 Thursday ESL Ch. 4 (Sec. 4.1–4.3; before 4.3.1, Sec. 4.4 Intro + 4.4.2) Classification: 0-1 regression; Logistic, probit and softmax regression; nearest centroids; linear, quadratic and diagonal discriminant analysis Lecture 2: Slides (post-lecture), Notes, Code
13 Friday ESL Ch. 7 (Sec. 7.1–7.5, 7.10) Model Assessment for Predictive Learning; Model Selection through Cross-Validation Lecture 3: Slides (post-lecture), Notes, Code
14 Monday ESL Sec. 9.2, 8.7, 15.1–15.3 Classification and Regression Trees (CART); Random Forests Lecture 4: Slides (post-lecture), Code
14 Thursday ESL Sec. 15.3.1–15.3.2; 4.3.1 Examples of Random Forests & Variable Importance; Singular Value Decomposition; Principal Component Analysis; Regularized Discriminant Analaysis Lecture 5: Slides (post-lecture), Code, Additional theory (optional)
14 Friday ESL Sec. 4.3.3; 14.3–14.3.6; Blogpost on caveats of k-means Fisher's LDA; Introduction to Clustering; Combinatorial Clustering; k-means Lecture 6: Slides (post-lecture), Code
15 Monday ESL Sec. 14.3.7–8, 14.3.10–12 k-medoids/partition around medoids; Selection of Cluster Count; Hierarchical Clustering; Gaussian Mixture Models Lecture 7: Slides (post-lecture), Code
15 Thursday ESL Sec. 8.5, 12.7 Expectation Maximization and Clustering; Mixture Discriminant Analysis; Density-based clustering/DBSCAN Lecture 8: Slides (post-lecture), Code
15 Friday
Project Presentations 1
16 Monday ESL Ch. 3.4; Sec 3.8.4–5 Penalized regression methods: Regularization and Variable selection (Ridge Regression, Lasso) Lecture 9: Slides (post-lecture), Code
18 Thursday ESL Sec. 3.8.4, 3.8.6; Ch. 18 up to and including 18.3.1; Sec. 18.4 Penalized classification: Nearest Shrunken Centroids; Computational aspects of the lasso; Elastic Net; Group Lasso Lecture 10: Slides (post-lecture), Code
18 Friday ESL Sec. 3.8.5–3.8.6; 14.5.1 (up to and including the handwritten digits example); 14.6; 14.7.1 Penalized regression: Oracle estimators; adaptive lasso; SCAD; sparse logistic regression; Data representations: SVD Revisited; Factor analysis; Non-negative Matrix Factorization Lecture 11: Slides (post-lecture), Code
19 Monday
Project Presentations 2
19 Thursday ESL Sec. 14.5.4; 14.6 Data representations: Non-negative Matrix Factorization (cont'd); Intro to kernels and kernel-methods; kernel-PCA Lecture 12: Slides (post-lecture), Code
19 Friday ESL Sec. 14.8; 14.9 (Intro + Isomap); Interactive demonstration and caveats of tSNE Data representations: Multi-dimensional scaling, Isomap, tSNE Lecture 13: Slides (post-lecture), Code
20 Monday
Project Presentations 3
20 Thursday
Cancelled
20 Friday ESL Sec. 14.5.3; 17.3 High-dimensional clustering: Subspace clustering; Spectral clustering; Graphical Lasso Lecture 14: Slides (post-lecture), Code
21 Monday
Project Presentations 4
21 Thursday
Large sample methods: Randomized Projection; Randomized SVD; Divide and Conquer; Random Forests for big-n; m-out-of-n bootstrap; bag of little bootstraps; leveraging Lecture 15: Slides (post-lecture), Code
21 Friday
Review Lecture 16: Slides (post-lecture)

Projects

A major part of this course will be the work on five small projects. Doing and presenting the projects is mandatory since an individual revision of each project will be 50% of the take-home exam and therefore of your grade.

The course will provide you with an overview of available methods and some examples of how they can be used. The projects on the other hand are meant to give you a deeper insight into algorithmic assumptions and hands-on experience of data analysis.

General information

Course requirements

The official course specific prerequites, as stated in the syllabus, are:

The prerequisites for the course are a basic course in statistical inference and MVE190/MSG500 Linear Statistical Models. Students can also contact the course instructor for permission to take the course.

This means you should be familiar with the following:

Examination

Your final grade will be based on an individual take-home exam. Question 1 of the final will be an individual write-up of the five projects where you can revise and improve the work you did during the course. The projects count for 50% percent of your final grade and attendance on presentation days is therefore compulsory. The other questions on the final will be a set of data analysis tasks, one of which is a "mini-project" on a data set of your own choice.

Examination procedures

In Chalmers Student Portal you can read about when exams are given and what rules apply on exams at Chalmers. In addition to that, there is a schedule when exams are given for courses at University of Gothenburg.

Since this course is examined through a take-home exam you will not have to register before taking it. If you have attended all presentation days then you are automatically admitted to take the exam. You will be informed about your admission towards the end of the course.

After the exam has been graded, you can see your results in Ladok by logging on to your Student portal.

At the annual (regular) examination:
When it is practical, a separate review is arranged. The date of the review will be announced here on the course homepage. Anyone who can not participate in the review may thereafter retrieve and review their exam at the Mathematical Sciences Student office. Check that you have the right grades and score. Any complaints about the marking must be submitted in writing at the office, where there is a form to fill out.

At re-examination:
Exams are reviewed and retrieved at the Mathematical Sciences Student office. Check that you have the right grades and score. Any complaints about the marking must be submitted in writing at the office, where there is a form to fill out.