-
Notifications
You must be signed in to change notification settings - Fork 3
/
Copy path05-data-splitting.Rmd
73 lines (53 loc) · 2.34 KB
/
05-data-splitting.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
# Splitting the data
------------------------------------------------------------------------
At this point we can split the data into calibration, validation, and predition sets:
* The calibration set comprises the samples identified in the previous section. The IDs of these samples are in the object `cal_smpls`.
* The validation set are the ones in the original set and that are labeled as `validation`. In the previous section the validation samples where extracted into a separate object (`valida`).
* The prediction set includes all the samples not selected for calibration and that were initially labeled as `cal_candidate`.
To split the data you can execute the following:
```{r eval = FALSE}
train <- data[as.character(data$ID) %in% cal_smpls, ]
pred <- data[!(as.character(data$ID) %in% c(cal_smpls)), ]
train$layer <- as.factor(substr(train$ID, 1, 1))
valida$layer <- as.factor(substr(valida$ID, 1, 1))
pred$layer <- as.factor(substr(pred$ID, 1, 1))
train$ID <- factor(train$ID)
valida$ID <- factor(valida$ID)
pred$ID <- factor(pred$ID)
```
Optionally, we can get rid of all the unncessary data (`R` objects that will not be used from now on):
```{r eval = FALSE}
## necessary objects
reqobjects <- c("train", "pred", "valida", "cal_smpls", "o2rm")
## objects to be removed
o2rm <- ls()[!ls() %in% reqobjects]
## remove the objects
rm(list = o2rm)
```
Alternatively...
```{r eval = FALSE}
## If you have saved the IDs of the calibration samples
## into your working directory you can:
cal_smpls <- readLines("calibration_samples_ids.txt")
```
and then...
```{r eval = FALSE}
## necessary objects
reqobjects <- c("cal_smpls", "o2rm")
## objects to be removed
o2rm <- ls()[!ls() %in% reqobjects]
## read again the data
nirfile <- file("https://github.com/l-ramirez-lopez/VNIR_spectroscopy_for_robust_soil_mapping/raw/master/SoilNIRSaoPaulo.rds")
data <- readRDS(nirfile)
## extract the validation samples into a new set/object
valida <- data[data$set == "validation",]
data <- data[data$set == "cal_candidate",]
train <- data[as.character(data$ID) %in% cal_smpls, ]
pred <- data[!(as.character(data$ID) %in% c(cal_smpls)), ]
train$layer <- as.factor(substr(train$ID, 1, 1))
valida$layer <- as.factor(substr(valida$ID, 1, 1))
pred$layer <- as.factor(substr(pred$ID, 1, 1))
train$ID <- factor(train$ID)
valida$ID <- factor(valida$ID)
pred$ID <- factor(pred$ID)
```