DATA

The assignment consists of analysing the Data: 2003 and 2004 collected from the students taking the course. Click here to get the data in the Splus format.
The data anonymously presents the following personal information:
  • S = student's sex (F/M)
  • C = student's original hair color, 1 for light, 2 for brown, 3 for black
  • H = student's height in cm
  • W = student's weight in kg
  • The assignment is to be done using S-PLUS. These links might be helpful when doing the assignment

    Assignments

    1. Summarize the data using summary statistics.

    2. Is there a relationship between person's S and C? Put together a 2x3 contingency table reflecting the joint distribution of two factors. Set an appropriate null hypothesis and test it at 5% significance level. What is the P-value of the test?

    3. Draw your conclusions after doing normal probability plots on the weights and on the heights.
     
    4. Fit a straight line to the scatterplot of weights vs heights. What is your conclusion about the relationship between them? Give an appropriate measure of dependence between the weight and height of a person.

    5. Estimate from the data the population means for the weight and the height. What are the standard errors of these estimates?

    6. Present the results of your analysis in a nice readable form.
     
     

    Help with SPLUS

    Data

  • Importing your ASCII file to Splus:
        data <- importData("your_file.txt",colNameRow=1)
        # Here your data is saved in SPLUS as data, the first row of your
        ASCII file data is used to name the columns
        # <- (and also _) assigns a name to an object
        
        
  • The filter argument to importData allows you to subset the data you import
        data.females <- importData("your_file.txt", colNameRow=1, filter="sex = 1")
        # Here only the data for females is imported
        
        
  • To have a look at the data, type the name of the data
         data.females 
        
        
  • To only select one column or row
        
        data[,n]
        # Here the n:th column is selected
    
        data[n,]
        # Here the n:th row is selected
    
        heigth <- data[,4]
        # Here a vector called height is created (the 4th column is saved
        as a vector called height) 
        
        
        

    Tables and plots

  • A simple table
        table(x,y)
        # Here x and y are the variables you want to tabulate
        
        
  • A contingency table
        crosstabs(~x+y)
        
        
  • A simple scatter plot
        plot(x,y, xlab="The label of x-axis", ylab="The label of y-axis")
        title("A scatter plot of x and y")
    
        #It is possible to add lines and points to the plot (z and w are
        the cordinates or the lines or the points)
        lines(z,w)
        points(z,w)
    
        
        
  • A histogram
        hist(x)
        
        
  • A quantile-quantile plot
        qqnorm(x)	
        
        

    Summary statistics and tests

  • Mean, variance, standard deviation, median, minimum, maximum
        mean(x)
        var(x)  # The sample variance
        stdev(x) # The sample standard deviation
        median(x)
        min(x)
        max(x)
        
        
  • A sample correlation coefficient
        cor(x,y)	
        
        
  • A Pearson's chi-square test on a two-dimensional contingency table
        chisq.test(x,y) 
        # Here x and y are the variables whose relationship you are interested in
        
        
  • Fitting a linear regression model, the least squares fitting method is the default
        lm(y ~ x)
        # Here x is the independent variable and y is the dependent variable