Latest news
Welcome to MVE440/MSA220 Statistical Learning for Big Data!

Upcoming office hours:
 Monday, 3rd June, 11.45–12.45 with Rebecka in L3056
 Tuesday, 4th June, 9.00–10.00 with Juan in L3098
 Wednesday, 5th June, 13.30–14.30 with Felix in L3098
 Note: There will not be a project 5 this year.
 Interactive website demonstrating some of the caveats of tSNE
 Office hours for week 20 (13th April  19th May) Tuesday 1112, Wednesday 13.3014.30, and Thursday 14.3015.30 (Changed) in the Mathematical Sciences building, room L3098
 Office hours for week 19 (6th May  12th May) Tuesday 10.0012.00 and Wednesday 13.3015.00 in the Mathematical Sciences building, room L3098. Due to other obligations I cannot offer office hours on Thursday or Friday, but feel free to ask in lecture pauses, after lectures or send an email.
 Office hours for week 18 (29th April  5th May) Monday 13.3015 and Thursday 13.3015 in the Mathematical Sciences building, room L3098
 Lecture slides for lecture 9 (15th April) are finally online with some additional (and improved) figures to hopefully increase understanding of penalization methods.
 Some notes on the linear algebra behind PCA and SVD can be found here.
 Office hours for week 15 (812th April) Monday 13.3015 and Wednesday 13.3015 in the Mathematical Sciences building, room L3098

The student representatives for this course are
 MPCAS Joakim Jansson JOAJANS
 MPCAS Lars Jansson LARJANS
 MPSYS Sondre Chanon Wiersdalen CHANON
 MPENM Klas Holmgren KLASHO
 The course starts on Monday, 25th March 2019, 15.15, in lecture hall HA3, Hörsalsvägen.
 The schedule for the course can be found in TimeEdit.
 For basic information on the course and the learning goals check out the syllabus for GU (in Swedish) and Chalmers (in English).
Teachers
Course coordinator: Felix Held (felix.held@chalmers.se)
Examinator: Rebecka Jörnsten (jornsten@chalmers.se)
Teaching assistant: Juan Inda Diaz (inda@chalmers.se)
Project presentations will be handled by all three of us.
Current office hours in News section.
Course literature
Most of the course content can be found in
Hastie, T, Tibshirani, R, and Friedman, J (2009) The Elements of Statistical Learning: Data Mining, Inference, and Prediction. 2nd ed. New York: Springer Science+Business Media, LLC
This book is freely available online. If I refer to ESL, then I mean this book.
Other helpful books

Books that have a more practical angle than ESL:

James, G, Witten, D, Hastie, T, and Tibshirani, R (2013) An Introduction to Statistical Learning: With Applications in R. New York: Springer Science+Business Media, LLC
The little sibling of ESL, freely available online.
 Kuhn, M and Johnson, K (2013) Applied Predictive Modeling. New York: Springer Science+Business Media, LLC

James, G, Witten, D, Hastie, T, and Tibshirani, R (2013) An Introduction to Statistical Learning: With Applications in R. New York: Springer Science+Business Media, LLC

For a theoretical angle on the more traditional (small to medium data) parts of the course:
 Falk, M, Marohn, F, and Tewes, B (2002) Foundations of Statistical Analyses and Applications with SAS. Basel: Birkhäuser

A very pedagogic book with a (mostly) Bayesian angle
 Bishop, CM (2006) Pattern Recognition and Machine Learning. New York: Springer

A compendium and also more focus on the Bayesian angle. Tougher for learning but great for reference
 Murphy, KP (2012) Machine Learning: A Probabilistic Perspective. Cambridge, Massachusetts: The MIT Press
Lectures
This is a preliminary lecture plan. It will be updated after each lecture to reflect reality. In addition, I'll link to lecture slides, black board notes (at least the ones that were planned) as well as the code to recreate the plots or numerical results presented in the lecture.
Week  Day  Recommended Reading  Contents  Files 

13  Monday  ESL Ch. 1 and 2  Course Overview, Introduction to Big Data and Statistical Learning  Lecture 1: Slides (postlecture), Notes, Code 
13  Thursday  ESL Ch. 4 (Sec. 4.1–4.3; before 4.3.1, Sec. 4.4 Intro + 4.4.2)  Classification: 01 regression; Logistic, probit and softmax regression; nearest centroids; linear, quadratic and diagonal discriminant analysis  Lecture 2: Slides (postlecture), Notes, Code 
13  Friday  ESL Ch. 7 (Sec. 7.1–7.5, 7.10)  Model Assessment for Predictive Learning; Model Selection through CrossValidation  Lecture 3: Slides (postlecture), Notes, Code 
14  Monday  ESL Sec. 9.2, 8.7, 15.1–15.3  Classification and Regression Trees (CART); Random Forests  Lecture 4: Slides (postlecture), Code 
14  Thursday  ESL Sec. 15.3.1–15.3.2; 4.3.1  Examples of Random Forests & Variable Importance; Singular Value Decomposition; Principal Component Analysis; Regularized Discriminant Analaysis  Lecture 5: Slides (postlecture), Code, Additional theory (optional) 
14  Friday  ESL Sec. 4.3.3; 14.3–14.3.6; Blogpost on caveats of kmeans  Fisher's LDA; Introduction to Clustering; Combinatorial Clustering; kmeans  Lecture 6: Slides (postlecture), Code 
15  Monday  ESL Sec. 14.3.7–8, 14.3.10–12  kmedoids/partition around medoids; Selection of Cluster Count; Hierarchical Clustering; Gaussian Mixture Models  Lecture 7: Slides (postlecture), Code 
15  Thursday  ESL Sec. 8.5, 12.7  Expectation Maximization and Clustering; Mixture Discriminant Analysis; Densitybased clustering/DBSCAN  Lecture 8: Slides (postlecture), Code 
15  Friday  Project Presentations 1  
16  Monday  ESL Ch. 3.4; Sec 3.8.4–5  Penalized regression methods: Regularization and Variable selection (Ridge Regression, Lasso)  Lecture 9: Slides (postlecture), Code 
18  Thursday  ESL Sec. 3.8.4, 3.8.6; Ch. 18 up to and including 18.3.1; Sec. 18.4  Penalized classification: Nearest Shrunken Centroids; Computational aspects of the lasso; Elastic Net; Group Lasso  Lecture 10: Slides (postlecture), Code 
18  Friday  ESL Sec. 3.8.5–3.8.6; 14.5.1 (up to and including the handwritten digits example); 14.6; 14.7.1  Penalized regression: Oracle estimators; adaptive lasso; SCAD; sparse logistic regression; Data representations: SVD Revisited; Factor analysis; Nonnegative Matrix Factorization  Lecture 11: Slides (postlecture), Code 
19  Monday  Project Presentations 2  
19  Thursday  ESL Sec. 14.5.4; 14.6  Data representations: Nonnegative Matrix Factorization (cont'd); Intro to kernels and kernelmethods; kernelPCA  Lecture 12: Slides (postlecture), Code 
19  Friday  ESL Sec. 14.8; 14.9 (Intro + Isomap); Interactive demonstration and caveats of tSNE  Data representations: Multidimensional scaling, Isomap, tSNE  Lecture 13: Slides (postlecture), Code 
20  Monday  Project Presentations 3  
20  Thursday  Cancelled  –  
20  Friday  ESL Sec. 14.5.3; 17.3  Highdimensional clustering: Subspace clustering; Spectral clustering; Graphical Lasso  Lecture 14: Slides (postlecture), Code 
21  Monday  Project Presentations 4  
21  Thursday  Large sample methods: Randomized Projection; Randomized SVD; Divide and Conquer; Random Forests for bign; moutofn bootstrap; bag of little bootstraps; leveraging  Lecture 15: Slides (postlecture), Code  
21  Friday  Review  Lecture 16: Slides (postlecture) 
Projects
A major part of this course will be the work on five small projects. Doing and presenting the projects is mandatory since an individual revision of each project will be 50% of the takehome exam and therefore of your grade.
The course will provide you with an overview of available methods and some examples of how they can be used. The projects on the other hand are meant to give you a deeper insight into algorithmic assumptions and handson experience of data analysis.
General information

Projects will be presented on
 Friday, 12th of April
 Mondays, 6th, 13th, 20th and 27th of May
 You will have at least one week to work on each project.
 Since there are many registered participants in this course, you will have to work in groups. We will provide you with a selection of topics. You will sign up individually for the topic you want to work on and there will be a maximum number of registrations allowed per topic. In the topics, we will randomly divide you into groups of 34 people. These will be different for each project.
 Projects are supposed to result in short presentations (510 slides) that you will present to your peers and which will be discussed. Focus on understanding, interpretation and your new insights. Clear and understandable plots are crucial.
 Everybody/every group is obligated to send in their slides and code for the project presentations, by 10.00 on presentation days (Friday or Monday).
 Depending on the number of participants in the course, it might not be possible that all groups present each project. If necessary, presenters will be chosen randomly. Note that you have to attend the presentations to make your submitted project count. If your group cannot present due to lack of time, then your project still counts.
 Be a team player! Attendance will be individual, so groups do not have a disadvantage if somebody does not show up, but don't let your team hanging!
 If you have a good, provable reason you are allowed to miss one presentation day, as long as you tell me beforehand.
 You are free in your choice of programming language (R, Python, Matlab, Julia, ...). Consider however that R and Python are currently the most used languages in data analysis and there will be the biggest choice of available implementations that you have access to. Additionally, all of us (Felix, Juan, Rebecka) work in R and will be able to help you most if you use that language.

Some resources if you want to learn (more) R for this course:
 An overview of resources to get started is at Getting Started with R.
 A very concise introduction to the main ideas behind the language can be found at R Tutorial.
 For a more complete explanation there is The Art of R and for advanced topics Advanced R.
 Resources for finding datasets (the terms interesting or awesome below are part of the names of the websites, not a value judgement)
Course requirements
The official course specific prerequites, as stated in the syllabus, are:
The prerequisites for the course are a basic course in statistical inference and MVE190/MSG500 Linear Statistical Models. Students can also contact the course instructor for permission to take the course.
This means you should be familiar with the following:
 Basic vector calculus and linear algebra (Matrices, vectors, gradients, ...)
 Basic distributions (Normal, Studentt, Gamma, ChiSquare, ...)
 Parameter estimation in the framework of maximum likelihood
 Knowledge about least squares methods and their statistical implications
 Linear regression and how to interprete its results
Examination
Your final grade will be based on an individual takehome exam. Question 1 of the final will be an individual writeup of the five projects where you can revise and improve the work you did during the course. The projects count for 50% percent of your final grade and attendance on presentation days is therefore compulsory. The other questions on the final will be a set of data analysis tasks, one of which is a "miniproject" on a data set of your own choice.
 You will receive the exam on 24th of May
 The submission deadline is on 14th of June, 23.59 o'clock. No exceptions!
 Exams are individual! The year is 2019 and text comparison is fast and automatic. Please do us (and yourself!) the favour and write your own submission.
Examination procedures
In Chalmers Student Portal you can read about when exams are given and what rules apply on exams at Chalmers. In addition to that, there is a schedule when exams are given for courses at University of Gothenburg.
Since this course is examined through a takehome exam you will not have to register before taking it. If you have attended all presentation days then you are automatically admitted to take the exam. You will be informed about your admission towards the end of the course.
After the exam has been graded, you can see your results in Ladok by logging on to your Student portal.
At the annual (regular) examination:
When it is practical, a separate review is arranged. The date of the
review will be announced here on the course homepage. Anyone who can not
participate in the review may thereafter retrieve and review their exam
at the Mathematical
Sciences Student office. Check that you have the right grades and
score. Any complaints about the marking must be submitted in writing at
the office, where there is a form to fill out.
At reexamination:
Exams are reviewed and retrieved at the Mathematical
Sciences Student office. Check that you have the right grades and
score. Any complaints about the marking must be submitted in writing at
the office, where there is a form to fill out.