MSA220, Statistical Learning for Big Data, Spring 18

Latest news

  • Exam 2018 f due June 8th.
  • I am traveling next week but will try to reply to email questions
  • Office hours Monday June 4th 15-17.
  • Better version of RMOA on github However - appears to still have some bugs in it... so use with caution.

  • People have asked me about the final. Here's some info. There will be 4 questions. The first question is a revisit of Mini 1-6: after the class is over you have an opportunity to redo them with the additional knowledge you've acquired over the course of the semester. Question 2-4 will be data analysis tasks on a data set I pick for you. I will use a real data and "manipulate" it in a, to you, unknown fashion. You will be asked specific analysis tasks like model selection, classification etc. I will hand out the final last week of classes and you will have 2 weeks to complete it.

  • Studentrepresentanter:


    Course coordinator: Rebecka Jörnsten

    Office hours: Mondays 14-15, Thursdays 9-10 in MVH3029 (starting 2nd week of classes)


    Course literature

    The Elements of Statistical Learning , Hastie, T., Tibshirani, R., and Friedman, J.

    Weblink to the book.
    We will also use Journal papers and other materials. These will be posted under "Programme".

    Recommended texts include:

  • "Statistics for High-dimensional Data: Methods, Theory and Applications", Springer 2011, P. Buhlmann and S v.d. Geer, editors.
  • "Handbook of Big Data", Chapman and Hall CRC, 2016. P. Buhlmann, P. Drineas, M. Kane, M vd Laan editors
  • An Introduction to Statistical Learning. Gareth James, Daniela Witten, Trevor Hastie and Robert Tibshirani



    Topics Chapter
    Lecture materials

    Introduction + Clustering

    Chapters 2.1-2.3. Skim Chapters 8.7, 9.2, 15

    Chapters 14.1-14.3 + journal papers

    The Parable of Google Flu: Traps in Big Data Analysis Science VOL 343 14 MARCH 2014 David Lazer,Ryan Kennedy, Gary King, Alessandro Vespignani
    Statistical Inference, Learning and Models in Big Data Franke et al, 2016
    Lecture 1, Lecture 1 - R code, puppy.txt

    Lecture 2, Lecture 2 - R code
    Data clustering: 50 years beyond k-means Anil K. Jain, Pattern Rec Letter, 31 (2010) 651–666
    Mini 1: due Thursday April 12th. Sign up on the doodle - check Lecture 2 notes for the link.


    2.1-2.7, 3.1-3.8, 4.1-4.4, 7.1-7.10, 13.3
    Lecture 3, Lecture 3 - R code
    Here's the paper that explains the various indices in the Nbclust package.
    Lecture 4, Lecture 4 - R code, wine data, Caret paper, Caret slides
    Lecture 5, Lecture 5 - R code

    High-dimensional modeling

    3.8, 18.2-18.4, 18.6
    Lecture 6, Lecture 6 - R code, , more R codespare LDA paper
    Lecture 7-8, R code
    Review paper on high-dimensional DA, Review paper feature selection

    Data representations: PCA, Factor Analysis, NMF...

    14.4-14.9, Journal papers
    Lecture 9, R code, HDI paper,
    Lecture 10, Lecture 11,R code, more R code
    sparse SVD paper, NMF paper generalized to structured sparsity
    Lecture 12,R code
    Nonlinear Dimension reduction - review paper,LLE paper
    Great DimRed review,DimRed tutorial

    Clustering revisited

    14.1-14.3 + journal papers
    Lecture 13 , Mclust R code , Codes for consensus clustering etc: More R code , Even more R code , Spectral clustering R code , Graphical lasso R code

    Journal papers: Modelbased clustering , Variable selection , The HDclassif package , The Highdim class paper , High-dim clustering
    Subspace clustering , Spectral clustering , Consensus clustering
    TCGAdata.RData TCGA data and class labels (load("TCGAdata.RData")

    Big n.
    Lecture notes and Journal papers
    Lecture 14 , Bootstrap R code , BLB and leverage code , KC house price data (csv file)
    Journal papers: Statistical methods and computing for big data
    Bag of Little Bootstraps ,Leveraging

    Journal papers for online learning/Mini 6
    DataStream Classification paper , DataStream clustering paper

    Big data versions of RF , Variants of decision trees , Bagging methods for concept drift , Online bagging paper.
    R package that includes these online or chunk-based classification method: RMOA (with poor documentation!). Here is documentation for the Java version
    MOA options . Scroll down to see which tuning parameters each method uses.
    Links to MOA information Code examples List of methods

    D-stream clustering method , Clustering stream data R package

    Lecture 15 , RMOA, Stream

    Lecture 16 , Sullivan and Feinn: Pvalues and Effect Size , A. Gelman: induction and deduction,A. Gelman: P-values and Statistical Practice
    B. Efron: A 250-year argument
    Raftery et al, Bayesian Model Averaging , Park and Casella: Bayesian Lasso
    Review ,


    There will be 6 Mini-Analysis projects. You can work in pairs for these, but not the same pairs. If you prefer to work on your own this is fine too.
    You have to hand in slides and be prepared to present results in class. Mini-Analyses are compulsory. You have to present at least 2 projects and I will randomly choose presenters each time. Mondays are Mini-Analysis day.

    Your final grade will be based on a take-home final. Question 1 of the final will be an individual write-up of the 6 Mini-Analyses where you can revise and improve the work you did during the course. The Minis count for 50% percent of your final grade and are compulsory. The other questions on the final will be a set of data analysis tasks, one of which is a "mini-project" on a data set of your own choice.

    Course requirements

    The learning goals of the course can be found in the course plan.



    Examination procedures

    In Chalmers Student Portal you can read about when exams are given and what rules apply on exams at Chalmers. In addition to that, there is a schedule when exams are given for courses at University of Gothenburg.

    Before the exam, it is important that you sign up for the examination. If you study at Chalmers, you will do this by the Chalmers Student Portal, and if you study at University of Gothenburg, you sign up via GU's Student Portal, where you also can read about what rules apply to examination at University of Gothenburg.

    At the exam, you should be able to show valid identification.

    After the exam has been graded, you can see your results in Ladok by logging on to your Student portal.

    At the annual (regular) examination:
    When it is practical, a separate review is arranged. The date of the review will be announced here on the course homepage. Anyone who can not participate in the review may thereafter retrieve and review their exam at the Mathematical Sciences Student office. Check that you have the right grades and score. Any complaints about the marking must be submitted in writing at the office, where there is a form to fill out.

    At re-examination:
    Exams are reviewed and retrieved at the Mathematical Sciences Student office. Check that you have the right grades and score. Any complaints about the marking must be submitted in writing at the office, where there is a form to fill out.

    Old exams