четвер, 31 жовтня 2013 р.

Basic visualisation in R


# plot two histograms together in R
p1 <- font=""> hist(rnorm(500,4))                     # centered at 4
p2 <- font=""> hist(rnorm(500,4))                     # centered at 4
plot( p1, col=rgb(0,0,1,1/4), xlim=c(0,10))  # first histogram
plot( p2, col=rgb(1,1,0,1/2), xlim=c(0,10), add=T)  # second


четвер, 3 жовтня 2013 р.

Six Hadoop distribution to consider about


Hortonworks is the most recent player that basically spun off Yahoo! instead of maintaining own Hadoop infrastructure in house. Everything they do is always open source and very close to Apache Hadoop project in their evolution. They have already turn into YARN and provide easy-to-start virtual machine. A lot of comprehensive tutorials are available on their web-sire as well as regular webinars.

Cloudera is perhaps the oldest and best known provider who turn Hadoop into a commercially viable product and is still the market leader. Cloudera product based on Apache Hadoop with a lot of own patches and enhancements that are release as open source. Also, some absolutely new and unique products were created by Cloudera as Cloudera Search (Solr on Hadoop). Moreover, there are some enterprise proprietary components available for additional money. They have already turn into YARN and also provide virtual machine for evaluation.

вівторок, 1 жовтня 2013 р.

Why R is so important on Hadoop?

The short history how R meets Hadoop:
  1. Hadoop + R via streaming
  2. RHadoop
  3. this one


Why R is so important on Hadoop?

The short answer is: the reasons are the same as for Oracle R Enterprise

The long answer is:
BI tools provide fairly limited statistical functionality. It
is therefore standard practice in many organizations to extract portions of a database into desktop software packages:
statistical package like SAS, Matlab or R, spreadsheets like
Excel, or custom code written in languages like Java.
There are various problems with this approach. First,
copying out a large database extract is often much less e -
cient than pushing computation to the data; it is easy to get
orders of magnitude performance gains by running code in
the database. Second, most stat packages require their data
to t in RAM. For large datasets, this means sampling the
database to form an extract, which loses detail.
(from "MAD Skills: New Analysis Practices for Big Data" http://db.cs.berkeley.edu/papers/vldb09-madskills.pdf)


Of course, you can develop with Java on Hadoop. But do your statisticians/data scientists/analysts familiar with Java? Perhaps not, the are familiar with R. RHadoop provides excellent way to develop working model locally and then deploy this model on Hadoop cluster. For sure, implemented MR in R can't be the best approach from Performance point of view. But think, how many people you will need to create high-performance solution in Java: one data scientist, one software engineer and more time on test/debug their solution. On the other hand one data scientist can create and deploy solution in less time. This "people performance" goes more important, specially when a lot of "one-time-happens" researches are required by business.