понеділок, 30 вересня 2013 р.

RHadoop

The short history how R meets Hadoop:

  1.  Hadoop + R via streaming
  2.  This one 

So, R through Streaming on Hadoop was discussed in a previous article. It's obvious that using streaming is not the best approach on Hadoop because some additional issues are appearing. To facilitate development with R on Hadoop RHadoop was created.

RHadoop is set of R packages aim to facilitate writing MapReduce code in R on Hadoop. It still uses streaming, but brings the following advantages:
  • don’t need to manage key change in Reducer
  • don’t need to control functions output manually
  • simple MapReduce API for R
  • enables access to files on HDFS 
  • R code can be run on local env/Hadoop without changes

Installation

To enable RHadoop on existing Hadoop cluster the following steps must be applied:
  1. install R on each node in Cluster
  2. on each node install RHadoop packages with dependencies
  3. set up env variables HADOOP_CMD and HADOOP_STREAMING; run R from console and check that these variables are accessible

What is RHadoop?

RHadoop is set of packages for R language, it contains the next packages currently:
  • rmr provides MapReduce interface; mapper and reducer can be described in R code and then called from R
  • rhdfs provides access to HDFS; using simple R functions, you can copy data between R memory, the local file system, and HDFS
  • rhbase required if you are going to access HBase 
You install and load this package the same as you would for any other R package.

  To start coding RHadoop, create regular R file, load required packages (if you need to access HDFS, call hdfs.init()). Then, you need to implement mapper function with interface 
function(key, line)
and reducer function (if you need)
function(key, val.list) 
The reduce function takes two arguments, one is a key and the other is a collection of all the values associated with that key. It could be one of vector, list, data frame or matrix depending on what was returned by the map function.
To emit value from mapper and reducer both you should use keyval function
keyval(key, val)

After that, you just need to call mapreduce function and MR jobs will be run in this moment (during execution, of course):

mapreduce(

  input,

  output = NULL,

  map = to.map(identity),

  reduce = NULL,
  combine = NULL,
  input.format = "native",
  output.format = "native"
)
In fact, this function has more parameters, I just highlight the most important. As you see, input and output formats is "native" by default which means sequence files instead of text. Also, csv and text can be used here, check documentation to get more information.
If you have implemented Mapper and Reducer, you can call MR code in the following manner:


mapreduce(input=INPUT,
          output=OUTPUT,
          input.format=make.input.format("csv", sep = ","),
          output.format="text",
          map=mapper,
          reduce=reducer)

Pay attention on input/output format, it is 'sequnce' hadoop format by default. That's why custom formatter has been used here. In the same maner, reading from HDFS into memory can be used:


d <- from.dfs("/user/hue/rhadoop.csv", format=make.input.format("csv", sep = ","))

Development notes

To facilitate development, RHadoop provides function 
rmr.options(backend = c("hadoop", "local"), ...
it has several arguments and the most important is backend execution mode which can be hadoop (default, runs execution on cluster) and local, runs execution on local environment. It can be useful for development/troubleshooting purposes. So, execution mode can be changed just by passing different argument into script without code changes. However, please mention that rhdfs package doesn't work at local mode at all.
And second pleasant surprise is RStudio server. It can be installed on any (one) node in Hadoop cluster and than used from anywhere through web UI (identical to desktop version what is amazing!)

Personal impressions

Wow!

Немає коментарів:

Дописати коментар