R is language for Stats, Math and Data Science created by statisticians for statisticians. It contains 5000+ implemented algorithms and impressive 2M+ users with domain knowledge worldwide. However, it has one big disadvantage - all data is placed into memory ... in one thread.
And there is Hadoop. New, powerful framework for distributed data processing. Hadoop is built upon idea of MapReduce algorithm, this isn't something very specific, a lot of languages have MR capabilities, but Hadoop brought it to the new level.The main idea of MR is:
- Map step: Map(k1,v1) → list(k2,v2)
- Magic here
- Reduce step: Reduce(k2, list (v2)) → list(v3)
Hadoop was developed in Java and Java is the main programming languages for Hadoop. Although Java is main language, you can still use any other language to write MR: for example, Python, R or OCaml. It is called "Streaming API"
Of course, not all features available in Java will be available in R, because streaming works through "unix streams", not surprise here. There are several streaming API drawbacks:
- while the inputs to the reducer are grouped by key, they are still iterated over line-by-line, and the boundaries between keys must be detected by the user
- no possibilities to utilize different mappers in one MapReduce job
- no possibilities to create different outputs from reducer
- not transparent counters update (streaming uses stderr to report counter updates)
To apply streaming, two separate files must be created, one of them will represent Mapper (map function) and second will be Reducer (reduce function)
As it was mentioned, mapper reads string from STDIN and emits string to STDOUT, and in the same way reducer works, we need to control key update manually (Hadoop guaranties only sorting by key).
Let's consider next mapper code, here we just extract required information from input and select key:
Reducer collect all values related to the one key and than perform regression to predict value in future period:
This code can be tested in local mode:cat file.txt | mapper.R | sort -k 1,1 | reducer.Ror can be run on Hadoop without changes