05-data-splitting.Rmd

# Splitting the data

------------------------------------------------------------------------

At this point we can split the data into calibration, validation, and predition sets:

* The calibration set comprises the samples identified in the previous section. The IDs of these samples are in the object `cal_smpls`. 

* The validation set are the ones in the original set and that are labeled as `validation`. In the previous section the validation samples where extracted into a separate object (`valida`). 

* The prediction set includes all the samples not selected for calibration and that were initially labeled as `cal_candidate`. 

To split the data you can execute the following:

```{r eval = FALSE}
train <- data[as.character(data$ID) %in%  cal_smpls, ]
pred <- data[!(as.character(data$ID) %in%  c(cal_smpls)), ]

train$layer <- as.factor(substr(train$ID, 1, 1))
valida$layer <- as.factor(substr(valida$ID, 1, 1))
pred$layer <- as.factor(substr(pred$ID, 1, 1))

train$ID <- factor(train$ID)
valida$ID <- factor(valida$ID)
pred$ID <- factor(pred$ID)
```

Optionally, we can get rid of all the unncessary data (`R` objects that will not be used from now on):
```{r eval = FALSE}
## necessary objects
reqobjects <- c("train", "pred", "valida", "cal_smpls", "o2rm")

## objects to be removed
o2rm <- ls()[!ls() %in% reqobjects]

## remove the objects
rm(list = o2rm)
```

Alternatively...
```{r eval = FALSE}
## If you have saved the IDs of the calibration samples 
## into your working directory you can:
cal_smpls <- readLines("calibration_samples_ids.txt")
```

and then...
```{r eval = FALSE}
## necessary objects
reqobjects <- c("cal_smpls", "o2rm")

## objects to be removed
o2rm <- ls()[!ls() %in% reqobjects]

## read again the data
nirfile <- file("https://github.com/l-ramirez-lopez/VNIR_spectroscopy_for_robust_soil_mapping/raw/master/SoilNIRSaoPaulo.rds")
data <- readRDS(nirfile)

## extract the validation samples into a new set/object
valida <- data[data$set == "validation",]
data <- data[data$set == "cal_candidate",]

train <- data[as.character(data$ID) %in%  cal_smpls, ]
pred <- data[!(as.character(data$ID) %in%  c(cal_smpls)), ]

train$layer <- as.factor(substr(train$ID, 1, 1))
valida$layer <- as.factor(substr(valida$ID, 1, 1))
pred$layer <- as.factor(substr(pred$ID, 1, 1))

train$ID <- factor(train$ID)
valida$ID <- factor(valida$ID)
pred$ID <- factor(pred$ID)
```