(Done using R Markdown)
1) Read the data and list first five lines
x = read.csv("http://www.ic.unicamp.br/%7Ewainer/cursos/1s2014/dados1.csv",
sep = "\t")
head(x, 5)
## A B C D
## 1 5.1 3.5 1.4 0.2
## 2 4.9 3.0 1.4 0.2
## 3 4.7 3.2 1.3 0.2
## 4 4.6 3.1 1.5 0.2
## 5 5.0 3.6 1.4 0.2
2) which has missing values, remove them
x[apply(is.na(x), 1, any), ]
## A B C D
## 35 4.9 3.1 NA 0.2
## 49 5.3 NA 1.5 0.2
x = na.omit(x)
3) What are the outliers? I used a boxplot to see the distribution of values in each attribute
boxplot(x)
So, I would say that the two extreme values of A and C are outliers. I also checked D and there is a negative value when all others are positive
hist(x$D)
I would consider that a outlier too. So the outliers are
x[x$A > 10 | x$C > 10 | x$D < 0, ]
## A B C D
## 58 4.9 2.4 12.4 1.0
## 89 15.6 3.0 4.1 1.3
## 90 5.5 2.5 4.0 -1.3
Remove them:
x = x[!(x$A > 10 | x$C > 10 | x$D < 0), ]
4) Histograms
hist(x$A, 10)
hist(x$A, 30)
The histogram with 10 bins show that the distribution of A is uni-modal while the histogram with 30 bins is much less informative with peaks and valleys that are just “noise”
5) Co-variance matrix
print(cov(x), digits = 2)
## A B C D
## A 0.693 -0.047 1.29 0.52
## B -0.047 0.188 -0.33 -0.12
## C 1.293 -0.331 3.15 1.31
## D 0.523 -0.122 1.31 0.59
there is no need to print numbers with all those digits just to get a sense of the data!
6) PCA
x.pca = princomp(x)
summary(x.pca)
## Importance of components:
## Comp.1 Comp.2 Comp.3 Comp.4
## Standard deviation 2.0607 0.48597 0.28276 0.155937
## Proportion of Variance 0.9258 0.05149 0.01743 0.005301
## Cumulative Proportion 0.9258 0.97727 0.99470 1.000000
plot(x.pca$sdev)
There are some theories of how many dimensions keep based on the cumulative explained variance. In this case, only one dimension seem to be the right thing. Reed more at http://stackoverflow.com/questions/12067446/how-many-principal-components-to-take
7) X-Y plot of the two larges dimentions of the PCA
plot(x.pca$scores[, 1:2])
or better, in the same scale, which shows that much of the variation is in the first dimension, as discussed above
plot(x.pca$scores[, 1:2], asp = 1)