## DATA

The assignment consists of analysing the Data: 2003 and 2004 collected from the students taking the course. Click here to get the data in the Splus format.
The data anonymously presents the following personal information:
• S = student's sex (F/M)
• C = student's original hair color, 1 for light, 2 for brown, 3 for black
• H = student's height in cm
• W = student's weight in kg
• The assignment is to be done using S-PLUS. These links might be helpful when doing the assignment

### Assignments

1. Summarize the data using summary statistics.

2. Is there a relationship between person's S and C? Put together a 2x3 contingency table reflecting the joint distribution of two factors. Set an appropriate null hypothesis and test it at 5% significance level. What is the P-value of the test?

3. Draw your conclusions after doing normal probability plots on the weights and on the heights.

4. Fit a straight line to the scatterplot of weights vs heights. What is your conclusion about the relationship between them? Give an appropriate measure of dependence between the weight and height of a person.

5. Estimate from the data the population means for the weight and the height. What are the standard errors of these estimates?

6. Present the results of your analysis in a nice readable form.

### Help with SPLUS

#### Data

• Importing your ASCII file to Splus:
```    data <- importData("your_file.txt",colNameRow=1)
# Here your data is saved in SPLUS as data, the first row of your
ASCII file data is used to name the columns
# <- (and also _) assigns a name to an object

```
• The filter argument to importData allows you to subset the data you import
```    data.females <- importData("your_file.txt", colNameRow=1, filter="sex = 1")
# Here only the data for females is imported

```
• To have a look at the data, type the name of the data
```     data.females

```
• To only select one column or row
```
data[,n]
# Here the n:th column is selected

data[n,]
# Here the n:th row is selected

heigth <- data[,4]
# Here a vector called height is created (the 4th column is saved
as a vector called height)

```

#### Tables and plots

• A simple table
```    table(x,y)
# Here x and y are the variables you want to tabulate

```
• A contingency table
```    crosstabs(~x+y)

```
• A simple scatter plot
```    plot(x,y, xlab="The label of x-axis", ylab="The label of y-axis")
title("A scatter plot of x and y")

#It is possible to add lines and points to the plot (z and w are
the cordinates or the lines or the points)
lines(z,w)
points(z,w)

```
• A histogram
```    hist(x)

```
• A quantile-quantile plot
```    qqnorm(x)

```

#### Summary statistics and tests

• Mean, variance, standard deviation, median, minimum, maximum
```    mean(x)
var(x)  # The sample variance
stdev(x) # The sample standard deviation
median(x)
min(x)
max(x)

```
• A sample correlation coefficient
```    cor(x,y)

```
• A Pearson's chi-square test on a two-dimensional contingency table
```    chisq.test(x,y)
# Here x and y are the variables whose relationship you are interested in

```
• Fitting a linear regression model, the least squares fitting method is the default
```    lm(y ~ x)
# Here x is the independent variable and y is the dependent variable

```