вівторок, 10 грудня 2013 р.

R connection to Hive

Short instruction how to query Hive from R via JDBC.



First of all install rJava: sudo apt-get install r-cran-rjava
After that install RJDBC package with all dependencies: install.packages("RJDBC",dep=TRUE)
In next step Hadoop libraries for Hive conneections must be added to classpath. The easiest way to do it: copy all jars for pattern /usr/lib/hive/lib/*.jar and /usr/lib/hadoop/*.jar to your classpath on target machine (when RJDBC client is located).

Also, HiveServer must be started, for Cloudera distribution use
hive --service hiveserver2

instead of
sudo service hive-server2 start

(as was mentioned http://www.cloudera.com/content/cloudera-content/cloudera-docs/CDH4/4.2.1/CDH4-Installation-Guide/cdh4ig_topic_18_8.html)
Now it is time to check if HiveServer is running properly, follow the next command line steps:
/usr/lib/hive/bin/beeline
beeline> connect jdbc:hive2://localhost:10000 username password org.apache.hive.jdbc.HiveDriver

Connecting to jdbc:hive2://127.0.0.1:10000/default
Connected to: Hive (version 0.10.0)
Driver: Hive (version 0.10.0-cdh4.3.0)
Transaction isolation: TRANSACTION_REPEATABLE_READ
Finaly, we can write R code to connect Hive and fetch some information
library(RJDBC)
# this is a regular JDBC connection
# jdbc:hive://192.168.0.104:10000/default
drv <- JDBC(driverClass = "org.apache.hive.jdbc.HiveDriver",
            classPath = list.files("/opt/jars/hive",pattern="jar$",full.names=T),
            identifier.quote="`")
conn <- dbConnect(drv, "jdbc:hive2://192.168.0.104:10000/default", "admin", "admin")

r <- dbGetQuery(conn, "select col_1, sum(col_2) from tab2 where id>? group by col_1", "10")
And the result is going to be like:
col_1          _c1
1 false 243808846,65
2  true       486,65