четвер, 24 квітня 2014 р.

Hue Notifier for Hadoop goes wild

Several months ago I developed Chrome browser plugin for my own needs. As a Hadoop engineer I faced with one problem everyday. I run a lot of Hive/Pig jobs simultaneously and they take a lot of time (from several minutes to several hours). So, I had mission to check job completion by walking Hue's pages in my browser. Well, it was 1) irritate, 2) draw away from coding...

As solution I developed Hue Notifier for Hadoop plugin for Google Chrome. It "monitors" state of job and inform you about completion similar to GMail informs about new mail (pop-up over all windows). I have a quite limited knowledge of JavaScript and it has been first time I wrote browser plugin... so, I'm absolutely sure it might be improved. I tested it with Hue delivered with Cloudera 4.3 and Cloudera 5 as well as HDP2.0. The most irritating issue w/ my code: Chrome Notification must be enabled manually before start using plugin :(

The source code is generally available at GitHub under this repository. You are welcome to fork and improve this one. Or, if you wish just to contribute, ping me and I will grant access (and push changes to Google Play afterwards).

пʼятниця, 18 квітня 2014 р.

Building BuilData ETL with Hive and Oozie

Perhaps, Hive is the most successful component of today's Hadoop infrastructure. It provides simple and efficient way of creating Hadoop-based data processing jobs with comfortable SQL-like language. But, in contract to Pig, it's not a workflow-friendly language and requires additional effort to create a real multi-step ETL.
Oozie was created to eliminate workflow/scheduling issues and, obvious, may be used to create ETL and naturally engages Hive.

вівторок, 1 квітня 2014 р.

Spark on HDP2

There is my first experience with Apache Spark, running it on Hadoop. I faced in several issues during running my piece of code.
To be honest, I started with Cloudera CDH5 distribution, they promised Spark was already added and usage will be simple. But no luck in fact, it doesn't work at all - even on local machine with their spark-cloudera jar. I didn't want to waste my time, so I just downloaded spark distro to HDP2.
First of all, let start Spark in standalone mode, according to documentation:
# start master

# pick up in the log output spark://IP:PORT
# and than run worker on each node
./bin/spark-class org.apache.spark.deploy.worker.Worker spark://IP:PORT

# more documentation available here https://spark.apache.org/docs/0.9.0/spark-standalone.html

After that I wrote some amount of Scala code, in fact to just count hardcoded words in document:

package experiment

import org.apache.spark.{SparkConf, SparkContext}

object SimpleApp {

  def main(args: Array[String]) {
    val logFile = args(0)  
  val conf = new SparkConf()
      .setAppName("My Spark application")
      .set("spark.executor.memory", "1g")
  val sc = new SparkContext(conf)

  // hdfs:///user/hue/input.txt
    val logData = sc.textFile(logFile, 2).cache()
    val numAs = logData.filter(line => line.contains("London")).count()
    val numBs = logData.filter(line =>; line.contains("Lviv")).count()
    println("Lines with London: %s, Lines with Lviv: %s".format(numAs, numBs))