Working With Modestly Large Datasets in R

2 Nov

Even modestly large (< 1 GB) datasets can quickly overwhelm modern personal computers. Working with such datasets in R can be still more frustrating because of how R uses memory. Here are a few tips on how to work with modestly large datasets in R.

Setting Memory Limits
On Windows, right click R and in the Target field set maximum vector size and memory size as follows:

"path\to\Rgui.exe" --max-vsize=4800M (Deprecated as of 2.14). 

Alternately, use

utils::memory.limit(size = 4800) in .Rprofile.

Type in mem.limits() to check maximum vector size
Learn more

Reading in CSVs
Either specify column classes manually or get the data type for each column by reading in the first few rows – enough so that data type can be inferred correctly – and using the class that R is using.

# Read the first 10 rows to get the classes
ads5    <- read.csv("data.csv", header = T, nrows = 10)
classes <- sapply(ads5, class)

Specifying the number of rows in the dataset (even a modestly greater number than what is there) can be useful.

read.csv("data.csv", header = T, nrows = N, colClasses = classes)  

Improvements in performance are not always stupendous but given the low cost of implementation, likely worthwhile.

Selective Reading
You can selectively read columns by specifying colClasses=NULL for the columns you don't want read.
Alternately, you can rely on cut. For instance,

data <- read.table(pipe("cut -f 2,5 -d, data.csv"))

Opening Connections
Trying to directly read CSV can end in disaster. Open a connection first to reduce memory demands.

abc <- file("data.csv")
bbc <- read.csv(abc)


f <- file("data.csv")
Df <- sqldf("select * from f", dbname = tempfile(), file.format = list(header = T, row.names = F))
Problems include inability to deal with fields which have commas etc.

Using Filehash

Filehash package stores files on the hard drive. You can access the data using either with() if dealing with env variable, or directly via dbLoad() that mimics the functionality of attach. Downside: it is tremendously slow.

dumpDF(read.csv("data.csv", header = T, nrows = N, colClasses = classes), dbName = "db01")
ads <- db2env(db = "db01")

Selecting Columns

subset(data, select = columnList) rather than data[, columnList].