supervised.Rmd

---
title: "Supervised Learning"
author: "Jonathan Rosenblatt"
date: "April 12, 2015"
output: 
  html_document:
    toc: true
---
In these examples, I will use two data sets from the `ElemStatLearn` package: `spam` for categorical predictions (spam or not spam?), and `prostate` for continuous predictions (size of cancerous tumor).
In `spam` we will try to decide if a mail is spam or not. 
In `prostate` we will try to predict the size of a cancerous tumor.

```{r}
source('make_samples.R')
```
You can now call `?prostate` and `?spam` to learn more about these data sets.

We also load some utility packages and functions that we will require down the road. 
```{r preamble}
library(magrittr) # for piping
library(dplyr) # for handeling data frames

# My own utility functions:
l2 <- function(x) x^2 %>% sum %>% sqrt 
l1 <- function(x) abs(x) %>% sum  
MSE <- function(x) x^2 %>% mean 
missclassification <- function(tab) sum(tab[c(2,3)])/sum(tab)
```

We also initialize the random number generator so that we all get the same results (at least upon a first run)
```{r set seed}
set.seed(2015)
```

# OLS

## OLS Regression

Starting with OLS regression, and a split train-test data set:
```{r OLS Regression}
View(prostate)
# now verify that your data looks as you would expect....

ols.1 <- lm(lcavol~. ,data = prostate.train)
# Train error:
MSE( predict(ols.1)- prostate.train$lcavol)
# Test error:
MSE( predict(ols.1, newdata = prostate.test)- prostate.test$lcavol)
```

Now using cross validation to estimate the prediction error:
```{r Cross Validation}
folds <- 10
fold.assignment <- sample(1:5, nrow(prostate), replace = TRUE)
errors <- NULL

for (k in 1:folds){
  prostate.cross.train <- prostate[fold.assignment!=k,]
  prostate.cross.test <-  prostate[fold.assignment==k,] 
  .ols <- lm(lcavol~. ,data = prostate.cross.train)
  .predictions <- predict(.ols, newdata=prostate.cross.test)
  .errors <-  .predictions - prostate.cross.test$lcavol
  errors <- c(errors, .errors)
}

# Cross validated prediction error:
MSE(errors)
```

Also trying a bootstrap prediction error:
```{r Bootstrap}
B <- 20
n <- nrow(prostate)
errors <- NULL

prostate.boot.test <-  prostate 
for (b in 1:B){
  prostate.boot.train <- prostate[sample(1:n, replace = TRUE),]
  .ols <- lm(lcavol~. ,data = prostate.boot.train)
  .predictions <- predict(.ols, newdata=prostate.boot.test)
  .errors <-  .predictions - prostate.boot.test$lcavol
  errors <- c(errors, .errors)
}

# Bootstrapped prediction error:
MSE(errors)
```


### OLS Regression Model Selection 


Best subset selection: find the best model of each size:
```{r best subset}
# install.packages('leaps')
library(leaps)

regfit.full <- prostate.train %>% 
  regsubsets(lcavol~.,data = ., method = 'exhaustive')
summary(regfit.full)
plot(regfit.full, scale = "Cp")
```


Train-Validate-Test Model Selection.
Example taken from [here](https://lagunita.stanford.edu/c4x/HumanitiesScience/StatLearning/asset/ch6.html)
```{r OLS TVT model selection}
model.n <- regfit.full %>% summary %>% length
X.train.named <- prostate.train %>% model.matrix(lcavol ~ ., data = .)  
X.test.named <- prostate.test %>% model.matrix(lcavol ~ ., data = .)  
View(X.test.named)

val.errors <- rep(NA, model.n)
train.errors <- rep(NA, model.n)
for (i in 1:model.n) {
    coefi <- coef(regfit.full, id = i)
    
    pred <-  X.train.named[, names(coefi)] %*% coefi
    train.errors[i] <- MSE(y.train - pred)

    pred <-  X.test.named[, names(coefi)] %*% coefi
    val.errors[i] <- MSE(y.test - pred)
}
plot(train.errors, ylab = "MSE", pch = 19, type = "black")
points(val.errors, pch = 19, type = "b", col="blue")

legend("topright", 
       legend = c("Training", "Validation"), 
       col = c("black", "blue"), 
       pch = 19)
```


AIC model selection: 
```{r OLS AIC}
# Forward search:
ols.0 <- lm(lcavol~1 ,data = prostate.train)
model.scope <- list(upper=ols.1, lower=ols.0)
step(ols.0, scope=model.scope, direction='forward', trace = TRUE)

# Backward search:
step(ols.1, scope=model.scope, direction='backward', trace = TRUE)
```


Cross Validated Model Selection.
```{r OLS CV}
[TODO]
```


Bootstrap model selection:
```{r OLS bootstrap}
[TODO]
```


Partial least squares and principal components:
```{r PLS}
pls::plsr()
pls::pcr()
```

Canonical correlation analyis:
```{r CCA}
cancor()

# Kernel based robust version
kernlab::kcca()
```


## OLS Classification
```{r OLS Classification}
# Making train and test sets:
ols.2 <- lm(spam~., data = spam.train.dummy)

# Train confusion matrix:
.predictions.train <- predict(ols.2) > 0.5
(confusion.train <- table(prediction=.predictions.train, truth=spam.train.dummy$spam))
missclassification(confusion.train)

# Test confusion matrix:
.predictions.test <- predict(ols.2, newdata = spam.test.dummy) > 0.5
(confusion.test <- table(prediction=.predictions.test, truth=spam.test.dummy$spam))
missclassification(confusion.test)
```


# Ridge Regression
```{r Ridge I}
# install.packages('ridge')
library(ridge)

ridge.1 <- linearRidge(lcavol~. ,data = prostate.train)
# Note that if not specified, lambda is chosen automatically by linearRidge.

# Train error:
MSE( predict(ridge.1)- prostate.train$lcavol)
# Test error:
MSE( predict(ridge.1, newdata = prostate.test)- prostate.test$lcavol)
```


Another implementation, which also automatically chooses the tuning parameter $\lambda$:
```{r Ridge II}
# install.packages('glmnet')
library(glmnet)
ridge.2 <- glmnet(x=X.train, y=y.train, alpha = 0)

# Train error:
MSE( predict(ridge.2, newx =X.train)- y.train)

# Test error:
MSE( predict(ridge.2, newx = X.test)- y.test)
```

__Note__:  `glmnet` is slightly picky.
I could not have created `y.train` using `select()` because I need a vector and not a `data.frame`. Also, `as.matrix` is there as `glmnet` expects a `matrix` class `x` argument.
Thse objects are created in the make_samples.R script, which we sourced in the beggining. 


# LASSO Regression
```{r LASSO}
# install.packages('glmnet')
library(glmnet)
lasso.1 <- glmnet(x=X.train, y=y.train, alpha = 1)

# Train error:
MSE( predict(lasso.1, newx =X.train)- y.train)

# Test error:
MSE( predict(lasso.1, newx = X.test)- y.test)
```


# Logistic Regression For Classification
```{r Logistic Regression}
logistic.1 <- glm(spam~., data = spam.train, family = binomial)
# numerical error. Probably due to too many predictors. 
# Maybe regularizing the logistic regressio with Ridge or LASSO will make things better?
```

In the next chunk, we do $l_2$ and $l_1$ regularized logistic regression.
Some technical remarks are in order:

- `glmnet` is picky with its inputs. This has already been discussed in the context of the LASSO regression above.
- The `predict` function for `glmnet` objects returns a prediction (see below) for many candidate  regularization levels $\lambda$. We thus we `cv.glmnet` which does an automatic cross validated selection of the best regularization level. 
```{r Regularized Logistic Regression}
library(glmnet)
# Ridge Regularization with CV selection of regularization:
logistic.2 <- cv.glmnet(x=X.train.spam, y=y.train.spam, family = "binomial", alpha = 0)
# LASSO Regularization with CV selection of regularization:
logistic.3 <- cv.glmnet(x=X.train.spam, y=y.train.spam, family = "binomial", alpha = 1)


# Train confusion matrix:
.predictions.train <- predict(logistic.2, newx = X.train.spam, type = 'class') 
(confusion.train <- table(prediction=.predictions.train, truth=spam.train$spam))
missclassification(confusion.train)

.predictions.train <- predict(logistic.3, newx = X.train.spam, type = 'class') 
(confusion.train <- table(prediction=.predictions.train, truth=spam.train$spam))
missclassification(confusion.train)

# Test confusion matrix:
.predictions.test <- predict(logistic.2, newx = X.test.spam, type='class') 
(confusion.test <- table(prediction=.predictions.test, truth=y.test.spam))
missclassification(confusion.test)

.predictions.test <- predict(logistic.3, newx = X.test, type='class') 
(confusion.test <- table(prediction=.predictions.test, truth=y.test))
missclassification(confusion.test)
```


# SVM

## Classification
```{r SVM classification}
library(e1071)
svm.1 <- svm(spam~., data = spam.train)

# Train confusion matrix:
.predictions.train <- predict(svm.1) 
(confusion.train <- table(prediction=.predictions.train, truth=spam.train$spam))
missclassification(confusion.train)

# Test confusion matrix:
.predictions.test <- predict(svm.1, newdata = spam.test) 
(confusion.test <- table(prediction=.predictions.test, truth=spam.test$spam))
missclassification(confusion.test)
```


## Regression
```{r SVM regression}
svm.2 <- svm(lcavol~., data = prostate.train)

# Train error:
MSE( predict(svm.2)- prostate.train$lcavol)
# Test error:
MSE( predict(svm.2, newdata = prostate.test)- prostate.test$lcavol)
```


# GAM Regression
```{r GAM}
# install.packages('mgcv')
library(mgcv)
form.1 <- lcavol~ s(lweight)+ s(age)+s(lbph)+s(svi)+s(lcp)+s(gleason)+s(pgg45)+s(lpsa)
gam.1 <- gam(form.1, data = prostate.train) # the model is too rich. let's select a variable subset

ridge.1 %>% coef %>% abs %>% sort(decreasing = TRUE) # select the most promising coefficients (a very arbitrary practice)
form.2 <- lcavol~  s(lweight)+ s(age)+s(lbph)+s(lcp)+s(pgg45)+s(lpsa) # keep only promising coefficients in model
gam.2 <- gam(form.2, data = prostate.train)

# Train error:
MSE( predict(gam.2)- prostate.train$lcavol)
# Test error:
MSE( predict(gam.2, newdata = prostate.test)- prostate.test$lcavol)
```


# Neural Net

## Regression
```{r NNET regression}
library(nnet)
nnet.1 <- nnet(lcavol~., size=20, data=prostate.train, rang = 0.1, decay = 5e-4, maxit = 1000)

# Train error:
MSE( predict(nnet.1)- prostate.train$lcavol)
# Test error:
MSE( predict(nnet.1, newdata = prostate.test)- prostate.test$lcavol)
```


Let's automate the network size selection:
```{r NNET validate}
validate.nnet <- function(size){
  .nnet <- nnet(lcavol~., size=size, data=prostate.train, rang = 0.1, decay = 5e-4, maxit = 200)
  .train <- MSE( predict(.nnet)- prostate.train$lcavol)
  .test <- MSE( predict(.nnet, newdata = prostate.test)- prostate.test$lcavol)
  return(list(train=.train, test=.test))
}

validate.nnet(3)
validate.nnet(4)
validate.nnet(20)
validate.nnet(50)

sizes <- seq(2, 30)
validate.sizes <- rep(NA, length(sizes))
for (i in seq_along(sizes)){
  validate.sizes[i] <- validate.nnet(sizes[i])$test
}
plot(validate.sizes~sizes, type='l')
```
What can I say... This plot is not what I would expect. Could be due to the random nature of the fitting algorithm.


## Classification
```{r NNET Classification}
nnet.2 <- nnet(spam~., size=5, data=spam.train, rang = 0.1, decay = 5e-4, maxit = 1000)

# Train confusion matrix:
.predictions.train <- predict(nnet.2, type='class') 
(confusion.train <- table(prediction=.predictions.train, truth=spam.train$spam))
missclassification(confusion.train)

# Test confusion matrix:
.predictions.test <- predict(nnet.2, newdata = spam.test, type='class') 
(confusion.test <- table(prediction=.predictions.test, truth=spam.test$spam))
missclassification(confusion.test)
```


# CART


## Regression
```{r Tree regression}
library(rpart)
tree.1 <- rpart(lcavol~., data=prostate.train)

# Train error:
MSE( predict(tree.1)- prostate.train$lcavol)
# Test error:
MSE( predict(tree.1, newdata = prostate.test)- prostate.test$lcavol)
```

At this stage we should prune the tree using `prune()`...

## Classification
```{r Tree classification}
tree.2 <- rpart(spam~., data=spam.train)

# Train confusion matrix:
.predictions.train <- predict(tree.2, type='class') 
(confusion.train <- table(prediction=.predictions.train, truth=spam.train$spam))
missclassification(confusion.train)

# Test confusion matrix:
.predictions.test <- predict(tree.2, newdata = spam.test, type='class') 
(confusion.test <- table(prediction=.predictions.test, truth=spam.test$spam))
missclassification(confusion.test)
```


# Random Forest
TODO

# Rotation Forest
TODO


# Smoothing Splines
I will demonstrate the method with a single predictor, so that we can visualize the smoothing that has been performed:

```{r Smoothing Splines}
spline.1 <- smooth.spline(x=X.train, y=y.train)

# Visualize the non linear hypothesis we have learned:
plot(y.train~X.train, col='red', type='h')
points(spline.1, type='l')
```
I am not extracting train and test errors as the output of `smooth.spline` will require some tweaking for that.


# KNN 

## Classification
```{r knn classification}
library(class)
knn.1 <- knn(train = X.train.spam, test = X.test.spam, cl =y.train.spam, k = 1)

# Test confusion matrix:
.predictions.test <- knn.1 
(confusion.test <- table(prediction=.predictions.test, truth=spam.test$spam))
missclassification(confusion.test)
```

And now we would try to optimize `k` by trying different values.


# Kernel Regression
Kernel regression includes many particular algorithms. 
```{r kernel}
# install.packages('np')
library(np)
ksmooth.1 <- npreg(txdat =X.train, tydat = y.train)

# Train error:
MSE( predict(ksmooth.1)- prostate.train$lcavol)
```

There is currently no method to make prediction on test data with this function.


# Stacking
As seen in the class notes, there are many ensemble methods.
Stacking, in my view, is by far the most useful and coolest. It is thus the only one I present here.

The following example is adapted from [James E. Yonamine](http://jayyonamine.com/?p=456).

```{r Stacking}
#####step 1: train models ####
#logits
logistic.2 <- cv.glmnet(x=X.train.spam, y=y.train.spam, family = "binomial", alpha = 0)
logistic.3 <- cv.glmnet(x=X.train.spam, y=y.train.spam, family = "binomial", alpha = 1)
 

# Learning Vector Quantization (LVQ)
my.codebook<-lvqinit(x=X.train.spam, cl=y.train.spam, size=10, prior=c(0.5,0.5),k = 2)
my.codebook<-lvq1(x=X.train.spam, cl=y.train.spam, codebk=my.codebook, niter = 100 * nrow(my.codebook$x), alpha = 0.03)
 
# SVM
library('e1071')
svm.fit <- svm(y=y.train.spam, x=X.train.spam, probability=TRUE)


#####step 2a: build predictions for data.train####
train.predict<- cbind(
  predict(logistic.2, newx=X.train.spam, type="response"),
  predict(logistic.3, newx=X.train.spam, type="response"),
  knn1(train=my.codebook$x, test=X.train.spam, cl=my.codebook$cl),
  predict(svm.fit, X.train.spam, probability=TRUE)
)

####step 2b: build predictions for data.test####
test.predict <- cbind(
  predict(logistic.2, newx=X.test.spam, type="response"),
  predict(logistic.3, newx=X.test.spam, type="response"),
  predict(svm.fit, newdata = X.test.spam, probability = TRUE),
  knn1(train=my.codebook$x, test=X.test.spam, cl=my.codebook$cl)
)

 
####step 3: train SVM on train.predict####
final <- svm(y=y.train.spam, x=train.predict, probability=TRUE)

####step 4: use trained SVM to make predictions with test.predict####
final.predict <- predict(final, test.predict, probability=TRUE)
results<-as.matrix(final.predict)
table(results, y.test.spam)
```


# Fisher's LDA
```{r LDA}
library(MASS) 
lda.1 <- lda(spam~., spam.train)

# Train confusion matrix:
.predictions.train <- predict(lda.1)$class
(confusion.train <- table(prediction=.predictions.train, truth=spam.train$spam))
missclassification(confusion.train)

# Test confusion matrix:
.predictions.test <- predict(lda.1, newdata = spam.test)$class
(confusion.test <- table(prediction=.predictions.test, truth=spam.test$spam))
missclassification(confusion.test)
```

__Caution__:
Both `MASS` have a function called `select`. I will thus try avoid the two packages being loaded at once, or call the functionby its full name: `MASS::select` or `dplyr::select'.


# Naive Bayes
```{r Naive Bayes}
library(e1071)
nb.1 <- naiveBayes(spam~., data = spam.train)

# Train confusion matrix:
.predictions.train <- predict(nb.1, newdata = spam.train)
(confusion.train <- table(prediction=.predictions.train, truth=spam.train$spam))
missclassification(confusion.train)

# Test confusion matrix:
.predictions.test <- predict(nb.1, newdata = spam.test)
(confusion.test <- table(prediction=.predictions.test, truth=spam.test$spam))
missclassification(confusion.test)
```