Exercise 2

ATTENTION - changed the PCA part - 21/3

Due date: 25/3

The zip file train17.zip contains a PGM image collection of hand written 1 and 7 digits. Each image has 64x64 pixels in the PGM format, where each pixel has value 0 or 1.Each image file has a name in the format X_yyy.BMP.inv.pgm where X is the digit represented in the image.

The file test17.zip contains test images in the same format.

PGM files start with 3 lines: P2 64 64 1 which are not relevant to us, followed by 64x64 pixels separated by a blank or a line change. For us, these 64x64 pixels represent the atributes/dimensions of the data. The class of each file/data is the digit represented in the file name.

read the data in the train and test files
use the train data in a k-nn classifier, and measure the error rate - the proportion of the test data that is incorreclty classified by the k-nn - the correct class is in the file name
use K=1,3,5,11,17,21. Which k has the lower error rate?
Reduce significantly the dimensionality using PCA to 100 and to 40 dimensions.
repeat the experiment above with the different k. Do not forget to transform the test data using the PCA on the training data (see below). Was there a change in the best k? and in the error rate?
There is a complication regarding the PCA - there are more dimensions than data. PCA that are based on the covariance matrix will not work in these cases (I don't know why). But most implementaitions of PCA is based on the SVD decomposiion of the data - and these will work. So be sure to use a SVD based PCA computation (such as prcomp in R).
Repeat the exercise for the files train49.zip for training and test49.zip for test - which deal with the digits 4 and 9.

How to use the PCA in R:
n number of dimensions to keep pca<- prcomp(train) newtrain<-pca$x[,1:n] newtest<-scale(test,pca$center,pca$scale)%*%pca$rotation[,1:n] The first line computes teh PCA. The second returns the train data transformed into the new reduced PCA dimensions - you should have done something similar in the first exercise. The third line use the training PCA to transform the test data.

Last modified: Mon Mar 17 11:28:06 BRT 2014