Why R is so important on Hadoop?
The short answer is: the reasons are the same as for Oracle R Enterprise
The long answer is:
BI tools provide fairly limited statistical functionality. It(from "MAD Skills: New Analysis Practices for Big Data" http://db.cs.berkeley.edu/papers/vldb09-madskills.pdf)
is therefore standard practice in many organizations to extract portions of a database into desktop software packages:
statistical package like SAS, Matlab or R, spreadsheets like
Excel, or custom code written in languages like Java.
There are various problems with this approach. First,
copying out a large database extract is often much less e -
cient than pushing computation to the data; it is easy to get
orders of magnitude performance gains by running code in
the database. Second, most stat packages require their data
to t in RAM. For large datasets, this means sampling the
database to form an extract, which loses detail.
Of course, you can develop with Java on Hadoop. But do your statisticians/data scientists/analysts familiar with Java? Perhaps not, the are familiar with R. RHadoop provides excellent way to develop working model locally and then deploy this model on Hadoop cluster. For sure, implemented MR in R can't be the best approach from Performance point of view. But think, how many people you will need to create high-performance solution in Java: one data scientist, one software engineer and more time on test/debug their solution. On the other hand one data scientist can create and deploy solution in less time. This "people performance" goes more important, specially when a lot of "one-time-happens" researches are required by business.