Short instruction how to query Hive from R via JDBC.
First of all install rJava: sudo apt-get install r-cran-rjava
After that install RJDBC package with all dependencies: install.packages("RJDBC",dep=TRUE)
In next step Hadoop libraries for Hive conneections must be added to classpath. The easiest way to do it: copy all jars for pattern /usr/lib/hive/lib/*.jar and /usr/lib/hadoop/*.jar to your classpath on target machine (when RJDBC client is located).
Also, HiveServer must be started, for Cloudera distribution use
hive --service hiveserver2
instead of
sudo service hive-server2 start
(as was mentioned http://www.cloudera.com/content/cloudera-content/cloudera-docs/CDH4/4.2.1/CDH4-Installation-Guide/cdh4ig_topic_18_8.html)
Now it is time to check if HiveServer is running properly, follow the next command line steps:
/usr/lib/hive/bin/beeline beeline> connect jdbc:hive2://localhost:10000 username password org.apache.hive.jdbc.HiveDriver Connecting to jdbc:hive2://127.0.0.1:10000/default Connected to: Hive (version 0.10.0) Driver: Hive (version 0.10.0-cdh4.3.0) Transaction isolation: TRANSACTION_REPEATABLE_READ
library(RJDBC) # this is a regular JDBC connection # jdbc:hive://192.168.0.104:10000/default drv <- JDBC(driverClass = "org.apache.hive.jdbc.HiveDriver", classPath = list.files("/opt/jars/hive",pattern="jar$",full.names=T), identifier.quote="`") conn <- dbConnect(drv, "jdbc:hive2://192.168.0.104:10000/default", "admin", "admin") r <- dbGetQuery(conn, "select col_1, sum(col_2) from tab2 where id>? group by col_1", "10")
col_1 _c1 1 false 243808846,65 2 true 486,65