-
Notifications
You must be signed in to change notification settings - Fork 13
/
Copy pathmassive_data.Rmd
318 lines (220 loc) · 9.38 KB
/
massive_data.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
---
title: "Analyzing Massive Data Sets"
author: "Jonathan Rosenblatt"
date: "23/04/2015"
output:
html_document:
toc: true
---
```{r setup, include=FALSE}
library(knitr)
opts_chunk$set(cache=TRUE)
```
# Introduction
When analyzing data, you may encounter several resource constraints:
- Hard Disk Space: your data might not fit your HD. This matter is not discussed in this text.
- RAM constraint: Your data fits in the HD but the implementation you are using of your favorite method needs more RAM that what you have. This is the main topic of this text, in which we demonstrate out-of-memory implementations of many popular algorithms.
- CPU constraint: Your algorithms has all the memory it needs, it simply runs too slowly. Parralelizing the computation on more cores in your machines, or on more machines, is in order.
## Disagnostics
In order to diagnose the resource limit you are encoutering, make sure you always work with your task-manager (Windows) or top (linux) open. The cases where you get error messages from your software are easy to diagnose. In other cases, where computations never end, but no erros are thrown, check which resource is runnning low in your task-manager.
## Terminology
- In-memory: processing loads the required data into RAM.
- Out-of-memory: processing is not done from RAM but rather from HD.
- Batch algorithm: loads all the data when processing.
- Streaming algorithm: the algorithm progresses by processing a sinle observation at a time.
- Mini-batch algorith: mid-way between batch and streaming.
- Swap file: a file in HD which mimiks RAM.
## Tips and Tricks
1. For *batch* algorithms memory usage should not exceed $30%$.
2. Swap files:
- NEVER use swap file.
3. R releases memory only when needed, not when possible ("lazy" release).
4. Don't count on R returning RAM to the operating system (at least in Linux). Restart R if FACEBOOK slows down.
5. When you want to go pro- read [Hadley's memory usage guide](http://adv-r.had.co.nz/memory.html)
## Bla bla... Let's see some code!
Inspiration from [here](http://www.r-bloggers.com/bigglm-on-your-big-data-set-in-open-source-r-it-just-works-similar-as-in-sas/).
Download a fat data file:
```{r download_data}
# download.file("http://www.cms.gov/Research-Statistics-Data-and-Systems/Statistics-Trends-and-Reports/BSAPUFS/Downloads/2010_Carrier_PUF.zip", "2010_Carrier_PUF.zip")
# unzip(zipfile="2010_Carrier_PUF.zip")
```
`data.table` package is much more efficient than `read.table' functions.
You should also consider the `readr` [package ](https://github.com/hadley/readr) which we did not document here (yet).
```{r import_data}
# install.packages('data.table')
library(data.table)
data <- data.table::fread(input = "2010_BSA_Carrier_PUF.csv",
sep = ',',
header=TRUE)
read.csv("2010_BSA_Carrier_PUF.csv")
library(magrittr) % # for piping syntax
.names <- c("sex", "age", "diagnose", "healthcare.procedure", "typeofservice", "service.count", "provider.type", "servicesprocessed", "place.served", "payment", "carrierline.count")
data %>% setnames(.names)
```
Now verify the size of your data in memory:
```{r}
object.size(data)
# But I prefer pryr:
pryr::object_size(data)
```
When does R create a copy of an object? Use `tracemem`
```{r tracemem}
tracemem(data)
.test <- glm(payment ~ sex + age + place.served, data = data[1:1e2,], family=poisson)
```
Profile each line of code for time and memory usage using [lineprof](https://github.com/hadley/lineprof)
```{r lineprof}
# devtools::install_github("hadley/lineprof")
prof <- lineprof::lineprof(
glm(payment ~ sex + age + place.served, data = data)
)
lineprof::shine(prof)
````
But actually, I just like to have my Task-Manager constantly open:
```{r inspect_RAM}
# Run and inspect RAM/CPU
glm(payment ~ sex + age + place.served, data = data, family=poisson)
```
Now lets artificially scale the problem.
Note: `copies` is small so that fitting can be done in real-time.
To demonstrate the problem, I would have set `copies <- 10`.
```{r artificial_scale}
copies <- 2
data.2 <- do.call(rbind, lapply(1:copies, function(x) data) )
system.time(data.2 %>% dim)
pryr::object_size(data)
pryr::object_size(data.2)
```
When you run the following code at home, it will *not* show memory exhaustion, but will take a long time to run and to release when stopped.
It is thus a *memory* constraint.
```{r}
## Don't run:
## glm.2 <-glm(payment ~ sex + age + place.served, data = data.2, family=poisson)
```
Since the data easily fits in RAM, it can be fixed simply by a *streaming* algorithm.
The following object, can't even be stored in RAM.
Streaming *from RAM* will not solve the problem.
We will get back to this...
```{r}
## Don't run:
## copies <- 1e2
## data.3 <- do.call(rbind, lapply(1:copies, function(x) data) )
```
# Streaming Regression
We now discover several R implementations of streaming algorithms, which overcome RAM constraints at a moderate CPU cost.
## biglm
```{r biglm}
# install.packages('biglm')
library(biglm)
mymodel <- biglm::bigglm(payment ~ sex + age + place.served,
data = data.2,
family = poisson(),
maxit=1e3)
# Too long! Quit the job and time the release.
# For demonstration: OLS example with original data.
mymodel <- bigglm(payment ~ sex + age + place.served, data =data )
mymodel <- data %>% bigglm(payment ~ sex + age + place.served, data =. )
```
Remarks:
- R is immediatly(!) available after quitting the job.
- `bigglm` objects behave (almost) like `glm` objects w.r.t. `coef`, `summary`,...
- `bigglm` is aimed at *memory* constraints. Not speed.
## Exploit sparsity in your data
Very relevant to factors with many levels.
```{r}
reps <- 1e6
y<-rnorm(reps)
x<- letters %>%
sample(reps, replace=TRUE) %>%
factor
X.1 <- model.matrix(~x-1) # Make dummy variable matrix
library(MatrixModels)
X.2<-as(x,"sparseMatrix") %>% t # Makes sparse dummy matrix
dim(X.1)
dim(X.2)
pryr::object_size(X.1)
pryr::object_size(X.2)
```
```{r}
system.time(lm.1 <- lm(y ~ X.1))
system.time(lm.1 <- lm.fit(y=y, x=X.1))
system.time(lm.2 <- MatrixModels:::lm.fit.sparse(X.2,y))
all.equal(lm.2, unname(lm.1$coefficients), tolerance = 1e-12)
```
# Streaming classification
[LiblineaR](http://cran.r-project.org/web/packages/LiblineaR/index.html), and [RSofia](http://cran.r-project.org/web/packages/RSofia/index.html) will stream from RAM your data for classification problems;
mainly SVMs.
# Out of memory Regression
What if it is not the __algorithm__ that causes the problem, but merely __importing__ my objects?
## ff
The `ff` packages replaces R's in-RAM storage mechanism with on-disk (efficient) storage.
First open a connection to the file, without actually importing it.
```{r}
# install.packages('LaF')
library(LaF)
.dat <- laf_open_csv(filename = "2010_BSA_Carrier_PUF.csv",
column_types = c("integer", "integer", "categorical", "categorical", "categorical", "integer", "integer", "categorical", "integer", "integer", "integer"),
column_names = c("sex", "age", "diagnose", "healthcare.procedure", "typeofservice", "service.count", "provider.type", "servicesprocessed", "place.served", "payment", "carrierline.count"),
skip = 1)
```
Now write the data to HD as an ff object:
```{r}
# install.packages('ffbase')
library(ffbase)
data.ffdf <- laf_to_ffdf(laf = .dat)
```
Notice the minimial RAM allocation:
```{r}
pryr::object_size(data)
pryr::object_size(data.ffdf)
```
Caution: `base` functions are unaware of `ff`.
Adapted algorithms are required...
```{r}
data$age %>% table
ffbase:::table.ff(data.ffdf$age)
```
Luckily, `bigglm` has it's `ff` version:
```{r biglm_regression}
mymodel.ffdf.2 <- bigglm.ffdf(payment ~ sex + age + place.served,
data = data.ffdf,
family = poisson(),
maxit=1e3)
# Again, too slow. Stop and run:
mymodel.ffdf.2 <- bigglm.ffdf(payment ~ sex + age + place.served,
data = data.ffdf)
```
The previous can scale to any file I can store on disk (but might take a while).
I will now inflate the data to a size that would not fit in RAM.
```{r}
copies <- 2e1
data.2.ffdf <- do.call(rbind, lapply(1:copies, function(x) data.ffdf) )
# Actual size:
cat('Size in GB ',sum(.rambytes[vmode(data.2.ffdf)]) * (nrow(data.2.ffdf) * 9.31322575 * 10^(-10)))
# In memory:
pryr::object_size(data.2.ffdf)
```
And now I can run this MASSIVE regression:
```{r biglm_ffdf_regression}
## Do no run:
# mymodel.ffdf.2 <- bigglm.ffdf(payment ~ sex + age + place.served,
# data = data.2.ffdf,
# family = poisson(),
# maxit=1e3)
```
Notes:
- Notice again the quick release when aborting process.
- Solving RAM constraints does not guarantee speed. This particular problem is actually worth parallelizing.
- SAS, SPSS, Revolutios-R,... all rely on similar ideas.
- Clearly, with so few variables I would be better of *subsampling*.
- The [SOAR](http://cran.r-project.org/web/packages/SOAR/index.html) package also allows similar out-of-memory processing.
# Out of memory Classification
I do not know if there are `ff` versions of `LiblineaR` or `RSofia`.
If you find out, let me know.
# Parallelation
## Parallelized learning
[TODO]
## Parallelized simulation
[TODO]
## Distributed Graph algorithms
[TODO]