massive_data.Rmd

---
title: "Analyzing Massive Data Sets"
author: "Jonathan Rosenblatt"
date: "23/04/2015"
output: 
  html_document:
    toc: true
---


```{r setup, include=FALSE}
library(knitr)
opts_chunk$set(cache=TRUE)
```

# Introduction
When analyzing data, you may encounter several resource constraints:

- Hard Disk Space: your data might not fit your HD. This matter is not discussed in this text.
- RAM constraint: Your data fits in the HD but the implementation you are using of your favorite method needs more RAM that what you have. This is the main topic of this text, in which we demonstrate out-of-memory implementations of many popular algorithms.
- CPU constraint: Your algorithms has all the memory it needs, it simply runs too slowly. Parralelizing the computation on more cores in your machines, or on more machines, is in order.

## Disagnostics
In order to diagnose the resource limit you are encoutering, make sure you always work with your task-manager (Windows) or top (linux) open. The cases where you get error messages from your software are easy to diagnose. In other cases, where computations never end, but no erros are thrown, check which resource is runnning low in your task-manager.


## Terminology

- In-memory: processing loads the required data into RAM.
- Out-of-memory: processing is not done from RAM but rather from HD.
- Batch algorithm: loads all the data when processing. 
- Streaming algorithm: the algorithm progresses by processing a sinle observation at a time.
- Mini-batch algorith: mid-way between batch and streaming. 
- Swap file: a file in HD which mimiks RAM. 

## Tips and Tricks

1. For *batch* algorithms memory usage should not exceed $30%$.
2. Swap files:
  - NEVER use swap file.
3. R releases memory only when needed, not when possible ("lazy" release).
4. Don't count on R returning RAM to the operating system (at least in Linux). Restart R if FACEBOOK slows down. 
5. When you want to go pro- read [Hadley's memory usage guide](http://adv-r.had.co.nz/memory.html)


## Bla bla... Let's see some code!

Inspiration from [here](http://www.r-bloggers.com/bigglm-on-your-big-data-set-in-open-source-r-it-just-works-similar-as-in-sas/).


Download a fat data file:
```{r download_data}
# download.file("http://www.cms.gov/Research-Statistics-Data-and-Systems/Statistics-Trends-and-Reports/BSAPUFS/Downloads/2010_Carrier_PUF.zip", "2010_Carrier_PUF.zip")
# unzip(zipfile="2010_Carrier_PUF.zip")
```

`data.table` package is much more efficient than `read.table' functions. 
You should also consider the `readr` [package ](https://github.com/hadley/readr) which we did not document here (yet).
```{r import_data}
# install.packages('data.table')
library(data.table)

data <- data.table::fread(input = "2010_BSA_Carrier_PUF.csv", 
                   sep = ',',
                   header=TRUE)


read.csv("2010_BSA_Carrier_PUF.csv")


library(magrittr) % # for piping syntax
.names <- c("sex", "age", "diagnose", "healthcare.procedure", "typeofservice", "service.count", "provider.type", "servicesprocessed", "place.served", "payment", "carrierline.count")
data %>% setnames(.names)
```

Now verify the size of your data in memory:
```{r}
object.size(data)
# But I prefer pryr:
pryr::object_size(data)
```

When does R create a copy of an object? Use `tracemem`
```{r tracemem}
tracemem(data)
.test <- glm(payment ~ sex + age + place.served, data = data[1:1e2,], family=poisson) 
```


Profile each line of code for time and memory usage using [lineprof](https://github.com/hadley/lineprof)
```{r lineprof}
# devtools::install_github("hadley/lineprof")
prof <- lineprof::lineprof(
  glm(payment ~ sex + age + place.served, data = data)
  )
lineprof::shine(prof)
````


But actually, I just like to have my Task-Manager constantly open:
```{r inspect_RAM}
# Run and inspect RAM/CPU
glm(payment ~ sex + age + place.served, data = data, family=poisson)
```


Now lets artificially scale the problem.
Note: `copies` is small so that fitting can be done in real-time.
To demonstrate the problem, I would have set `copies <- 10`.
```{r artificial_scale}
copies <- 2
data.2 <- do.call(rbind, lapply(1:copies, function(x) data) )
system.time(data.2 %>% dim)
pryr::object_size(data)
pryr::object_size(data.2)
```


When you run the following code at home, it will *not* show memory exhaustion, but will take a long time to run and to release when stopped.
It is thus a *memory* constraint.
```{r}
## Don't run:
## glm.2 <-glm(payment ~ sex + age + place.served, data = data.2, family=poisson)
```
Since the data easily fits in RAM, it can be fixed simply by a *streaming* algorithm. 


The following object, can't even be stored in RAM.
Streaming *from RAM* will not solve the problem. 
We will get back to this...
```{r}
## Don't run:
## copies <- 1e2
## data.3 <- do.call(rbind, lapply(1:copies, function(x) data) )
```


# Streaming Regression

We now discover several R implementations of streaming algorithms, which overcome RAM constraints at a moderate CPU cost.

## biglm
```{r biglm}
# install.packages('biglm')
library(biglm)
mymodel <- biglm::bigglm(payment ~ sex + age + place.served, 
                  data = data.2, 
                  family = poisson(), 
                  maxit=1e3)

# Too long! Quit the job and time the release.

# For demonstration: OLS example with original data.
mymodel <- bigglm(payment ~ sex + age + place.served, data =data )
mymodel <- data %>% bigglm(payment ~ sex + age + place.served, data =. )
```
Remarks:
- R is immediatly(!) available after quitting the job.
- `bigglm` objects behave (almost) like `glm` objects w.r.t. `coef`, `summary`,...
- `bigglm` is aimed at *memory* constraints. Not speed.


## Exploit sparsity in your data
Very relevant to factors with many levels.
```{r}
reps <- 1e6
y<-rnorm(reps)
x<- letters %>% 
  sample(reps, replace=TRUE) %>% 
  factor

X.1 <- model.matrix(~x-1) # Make dummy variable matrix

library(MatrixModels)
X.2<-as(x,"sparseMatrix") %>% t # Makes sparse dummy matrix

dim(X.1)
dim(X.2)

pryr::object_size(X.1)
pryr::object_size(X.2)
```


```{r}
system.time(lm.1 <- lm(y ~ X.1))
system.time(lm.1 <- lm.fit(y=y, x=X.1))
system.time(lm.2 <- MatrixModels:::lm.fit.sparse(X.2,y))

all.equal(lm.2, unname(lm.1$coefficients), tolerance = 1e-12)
```


# Streaming classification
[LiblineaR](http://cran.r-project.org/web/packages/LiblineaR/index.html), and [RSofia](http://cran.r-project.org/web/packages/RSofia/index.html) will stream from RAM your data for classification problems;
mainly SVMs.


# Out of memory Regression

What if it is not the __algorithm__ that causes the problem, but merely __importing__ my objects?


## ff
The `ff` packages replaces R's in-RAM storage mechanism with on-disk (efficient) storage.
First open a connection to the file, without actually importing it.
```{r}
# install.packages('LaF')
library(LaF)

.dat <- laf_open_csv(filename = "2010_BSA_Carrier_PUF.csv",
                    column_types = c("integer", "integer", "categorical", "categorical", "categorical", "integer", "integer", "categorical", "integer", "integer", "integer"), 
                    column_names = c("sex", "age", "diagnose", "healthcare.procedure", "typeofservice", "service.count", "provider.type", "servicesprocessed", "place.served", "payment", "carrierline.count"), 
                    skip = 1)
```
Now write the data to HD as an ff object:
```{r}
# install.packages('ffbase')
library(ffbase)
data.ffdf <- laf_to_ffdf(laf = .dat)
```
Notice the minimial RAM allocation:
```{r}
pryr::object_size(data)
pryr::object_size(data.ffdf)
```


Caution: `base` functions are unaware of `ff`.
Adapted algorithms are required...
```{r}
data$age %>% table
ffbase:::table.ff(data.ffdf$age) 
```


Luckily, `bigglm` has it's `ff` version:
```{r biglm_regression}
mymodel.ffdf.2 <- bigglm.ffdf(payment ~ sex + age + place.served, 
                              data = data.ffdf, 
                              family = poisson(), 
                              maxit=1e3)

# Again, too slow. Stop and run:
mymodel.ffdf.2 <- bigglm.ffdf(payment ~ sex + age + place.served, 
                              data = data.ffdf)
```
The previous can scale to any file I can store on disk (but might take a while).


I will now inflate the data to a size that would not fit in RAM.
```{r}
copies <- 2e1
data.2.ffdf <- do.call(rbind, lapply(1:copies, function(x) data.ffdf) )

# Actual size:
cat('Size in GB ',sum(.rambytes[vmode(data.2.ffdf)]) * (nrow(data.2.ffdf) * 9.31322575 * 10^(-10)))
  
# In memory:
pryr::object_size(data.2.ffdf)
```


And now I can run this MASSIVE regression:
```{r biglm_ffdf_regression}
## Do no run:

#  mymodel.ffdf.2 <- bigglm.ffdf(payment ~ sex + age + place.served,                            
#                                data = data.2.ffdf, 
#                                family = poisson(), 
#                                maxit=1e3)
```
Notes:

- Notice again the quick release when aborting process.
- Solving RAM constraints does not guarantee speed. This particular problem is actually worth parallelizing.
- SAS, SPSS, Revolutios-R,... all rely on similar ideas. 
- Clearly, with so few variables I would be better of *subsampling*.
- The [SOAR](http://cran.r-project.org/web/packages/SOAR/index.html) package also allows similar out-of-memory processing. 

# Out of memory Classification
I do not know if there are `ff` versions of `LiblineaR` or `RSofia`.
If you find out, let me know.


# Parallelation

## Parallelized learning
[TODO]

## Parallelized simulation
[TODO]

## Distributed Graph algorithms
[TODO]