Latest news
Welcome to MVE440/MSA220 Statistical Learning for Big Data!
-
Upcoming office hours:
- Monday, 3rd June, 11.45–12.45 with Rebecka in L3056
- Tuesday, 4th June, 9.00–10.00 with Juan in L3098
- Wednesday, 5th June, 13.30–14.30 with Felix in L3098
- Note: There will not be a project 5 this year.
- Interactive website demonstrating some of the caveats of tSNE
- Office hours for week 20 (13th April - 19th May) Tuesday 11-12, Wednesday 13.30-14.30, and Thursday 14.30-15.30 (Changed) in the Mathematical Sciences building, room L3098
- Office hours for week 19 (6th May - 12th May) Tuesday 10.00-12.00 and Wednesday 13.30-15.00 in the Mathematical Sciences building, room L3098. Due to other obligations I cannot offer office hours on Thursday or Friday, but feel free to ask in lecture pauses, after lectures or send an email.
- Office hours for week 18 (29th April - 5th May) Monday 13.30-15 and Thursday 13.30-15 in the Mathematical Sciences building, room L3098
- Lecture slides for lecture 9 (15th April) are finally online with some additional (and improved) figures to hopefully increase understanding of penalization methods.
- Some notes on the linear algebra behind PCA and SVD can be found here.
- Office hours for week 15 (8-12th April) Monday 13.30-15 and Wednesday 13.30-15 in the Mathematical Sciences building, room L3098
-
The student representatives for this course are
- MPCAS Joakim Jansson JOAJANS
- MPCAS Lars Jansson LARJANS
- MPSYS Sondre Chanon Wiersdalen CHANON
- MPENM Klas Holmgren KLASHO
- The course starts on Monday, 25th March 2019, 15.15, in lecture hall HA3, Hörsalsvägen.
- The schedule for the course can be found in TimeEdit.
- For basic information on the course and the learning goals check out the syllabus for GU (in Swedish) and Chalmers (in English).
Teachers
Course coordinator: Felix Held (felix.held@chalmers.se)
Examinator: Rebecka Jörnsten (jornsten@chalmers.se)
Teaching assistant: Juan Inda Diaz (inda@chalmers.se)
Project presentations will be handled by all three of us.
Current office hours in News section.
Course literature
Most of the course content can be found in
Hastie, T, Tibshirani, R, and Friedman, J (2009) The Elements of Statistical Learning: Data Mining, Inference, and Prediction. 2nd ed. New York: Springer Science+Business Media, LLC
This book is freely available online. If I refer to ESL, then I mean this book.
Other helpful books
-
Books that have a more practical angle than ESL:
-
James, G, Witten, D, Hastie, T, and Tibshirani, R (2013) An Introduction to Statistical Learning: With Applications in R. New York: Springer Science+Business Media, LLC
The little sibling of ESL, freely available online.
- Kuhn, M and Johnson, K (2013) Applied Predictive Modeling. New York: Springer Science+Business Media, LLC
-
James, G, Witten, D, Hastie, T, and Tibshirani, R (2013) An Introduction to Statistical Learning: With Applications in R. New York: Springer Science+Business Media, LLC
-
For a theoretical angle on the more traditional (small to medium data) parts of the course:
- Falk, M, Marohn, F, and Tewes, B (2002) Foundations of Statistical Analyses and Applications with SAS. Basel: Birkhäuser
-
A very pedagogic book with a (mostly) Bayesian angle
- Bishop, CM (2006) Pattern Recognition and Machine Learning. New York: Springer
-
A compendium and also more focus on the Bayesian angle. Tougher for learning but great for reference
- Murphy, KP (2012) Machine Learning: A Probabilistic Perspective. Cambridge, Massachusetts: The MIT Press
Lectures
This is a preliminary lecture plan. It will be updated after each lecture to reflect reality. In addition, I'll link to lecture slides, black board notes (at least the ones that were planned) as well as the code to recreate the plots or numerical results presented in the lecture.
Week | Day | Recommended Reading | Contents | Files |
---|---|---|---|---|
13 | Monday | ESL Ch. 1 and 2 | Course Overview, Introduction to Big Data and Statistical Learning | Lecture 1: Slides (post-lecture), Notes, Code |
13 | Thursday | ESL Ch. 4 (Sec. 4.1–4.3; before 4.3.1, Sec. 4.4 Intro + 4.4.2) | Classification: 0-1 regression; Logistic, probit and softmax regression; nearest centroids; linear, quadratic and diagonal discriminant analysis | Lecture 2: Slides (post-lecture), Notes, Code |
13 | Friday | ESL Ch. 7 (Sec. 7.1–7.5, 7.10) | Model Assessment for Predictive Learning; Model Selection through Cross-Validation | Lecture 3: Slides (post-lecture), Notes, Code |
14 | Monday | ESL Sec. 9.2, 8.7, 15.1–15.3 | Classification and Regression Trees (CART); Random Forests | Lecture 4: Slides (post-lecture), Code |
14 | Thursday | ESL Sec. 15.3.1–15.3.2; 4.3.1 | Examples of Random Forests & Variable Importance; Singular Value Decomposition; Principal Component Analysis; Regularized Discriminant Analaysis | Lecture 5: Slides (post-lecture), Code, Additional theory (optional) |
14 | Friday | ESL Sec. 4.3.3; 14.3–14.3.6; Blogpost on caveats of k-means | Fisher's LDA; Introduction to Clustering; Combinatorial Clustering; k-means | Lecture 6: Slides (post-lecture), Code |
15 | Monday | ESL Sec. 14.3.7–8, 14.3.10–12 | k-medoids/partition around medoids; Selection of Cluster Count; Hierarchical Clustering; Gaussian Mixture Models | Lecture 7: Slides (post-lecture), Code |
15 | Thursday | ESL Sec. 8.5, 12.7 | Expectation Maximization and Clustering; Mixture Discriminant Analysis; Density-based clustering/DBSCAN | Lecture 8: Slides (post-lecture), Code |
15 | Friday | Project Presentations 1 | ||
16 | Monday | ESL Ch. 3.4; Sec 3.8.4–5 | Penalized regression methods: Regularization and Variable selection (Ridge Regression, Lasso) | Lecture 9: Slides (post-lecture), Code |
18 | Thursday | ESL Sec. 3.8.4, 3.8.6; Ch. 18 up to and including 18.3.1; Sec. 18.4 | Penalized classification: Nearest Shrunken Centroids; Computational aspects of the lasso; Elastic Net; Group Lasso | Lecture 10: Slides (post-lecture), Code |
18 | Friday | ESL Sec. 3.8.5–3.8.6; 14.5.1 (up to and including the handwritten digits example); 14.6; 14.7.1 | Penalized regression: Oracle estimators; adaptive lasso; SCAD; sparse logistic regression; Data representations: SVD Revisited; Factor analysis; Non-negative Matrix Factorization | Lecture 11: Slides (post-lecture), Code |
19 | Monday | Project Presentations 2 | ||
19 | Thursday | ESL Sec. 14.5.4; 14.6 | Data representations: Non-negative Matrix Factorization (cont'd); Intro to kernels and kernel-methods; kernel-PCA | Lecture 12: Slides (post-lecture), Code |
19 | Friday | ESL Sec. 14.8; 14.9 (Intro + Isomap); Interactive demonstration and caveats of tSNE | Data representations: Multi-dimensional scaling, Isomap, tSNE | Lecture 13: Slides (post-lecture), Code |
20 | Monday | Project Presentations 3 | ||
20 | Thursday | Cancelled | – | |
20 | Friday | ESL Sec. 14.5.3; 17.3 | High-dimensional clustering: Subspace clustering; Spectral clustering; Graphical Lasso | Lecture 14: Slides (post-lecture), Code |
21 | Monday | Project Presentations 4 | ||
21 | Thursday | Large sample methods: Randomized Projection; Randomized SVD; Divide and Conquer; Random Forests for big-n; m-out-of-n bootstrap; bag of little bootstraps; leveraging | Lecture 15: Slides (post-lecture), Code | |
21 | Friday | Review | Lecture 16: Slides (post-lecture) |
Projects
A major part of this course will be the work on five small projects. Doing and presenting the projects is mandatory since an individual revision of each project will be 50% of the take-home exam and therefore of your grade.
The course will provide you with an overview of available methods and some examples of how they can be used. The projects on the other hand are meant to give you a deeper insight into algorithmic assumptions and hands-on experience of data analysis.
General information
-
Projects will be presented on
- Friday, 12th of April
- Mondays, 6th, 13th, 20th and 27th of May
- You will have at least one week to work on each project.
- Since there are many registered participants in this course, you will have to work in groups. We will provide you with a selection of topics. You will sign up individually for the topic you want to work on and there will be a maximum number of registrations allowed per topic. In the topics, we will randomly divide you into groups of 3-4 people. These will be different for each project.
- Projects are supposed to result in short presentations (5-10 slides) that you will present to your peers and which will be discussed. Focus on understanding, interpretation and your new insights. Clear and understandable plots are crucial.
- Everybody/every group is obligated to send in their slides and code for the project presentations, by 10.00 on presentation days (Friday or Monday).
- Depending on the number of participants in the course, it might not be possible that all groups present each project. If necessary, presenters will be chosen randomly. Note that you have to attend the presentations to make your submitted project count. If your group cannot present due to lack of time, then your project still counts.
- Be a team player! Attendance will be individual, so groups do not have a disadvantage if somebody does not show up, but don't let your team hanging!
- If you have a good, provable reason you are allowed to miss one presentation day, as long as you tell me beforehand.
- You are free in your choice of programming language (R, Python, Matlab, Julia, ...). Consider however that R and Python are currently the most used languages in data analysis and there will be the biggest choice of available implementations that you have access to. Additionally, all of us (Felix, Juan, Rebecka) work in R and will be able to help you most if you use that language.
-
Some resources if you want to learn (more) R for this course:
- An overview of resources to get started is at Getting Started with R.
- A very concise introduction to the main ideas behind the language can be found at R Tutorial.
- For a more complete explanation there is The Art of R and for advanced topics Advanced R.
- Resources for finding datasets (the terms interesting or awesome below are part of the names of the websites, not a value judgement)
Course requirements
The official course specific prerequites, as stated in the syllabus, are:
The prerequisites for the course are a basic course in statistical inference and MVE190/MSG500 Linear Statistical Models. Students can also contact the course instructor for permission to take the course.
This means you should be familiar with the following:
- Basic vector calculus and linear algebra (Matrices, vectors, gradients, ...)
- Basic distributions (Normal, Student-t, Gamma, Chi-Square, ...)
- Parameter estimation in the framework of maximum likelihood
- Knowledge about least squares methods and their statistical implications
- Linear regression and how to interprete its results
Examination
Your final grade will be based on an individual take-home exam. Question 1 of the final will be an individual write-up of the five projects where you can revise and improve the work you did during the course. The projects count for 50% percent of your final grade and attendance on presentation days is therefore compulsory. The other questions on the final will be a set of data analysis tasks, one of which is a "mini-project" on a data set of your own choice.
- You will receive the exam on 24th of May
- The submission deadline is on 14th of June, 23.59 o'clock. No exceptions!
- Exams are individual! The year is 2019 and text comparison is fast and automatic. Please do us (and yourself!) the favour and write your own submission.
Examination procedures
In Chalmers Student Portal you can read about when exams are given and what rules apply on exams at Chalmers. In addition to that, there is a schedule when exams are given for courses at University of Gothenburg.
Since this course is examined through a take-home exam you will not have to register before taking it. If you have attended all presentation days then you are automatically admitted to take the exam. You will be informed about your admission towards the end of the course.
After the exam has been graded, you can see your results in Ladok by logging on to your Student portal.
At the annual (regular) examination:
When it is practical, a separate review is arranged. The date of the
review will be announced here on the course homepage. Anyone who can not
participate in the review may thereafter retrieve and review their exam
at the Mathematical
Sciences Student office. Check that you have the right grades and
score. Any complaints about the marking must be submitted in writing at
the office, where there is a form to fill out.
At re-examination:
Exams are reviewed and retrieved at the Mathematical
Sciences Student office. Check that you have the right grades and
score. Any complaints about the marking must be submitted in writing at
the office, where there is a form to fill out.