пʼятницю, 10 жовтня 2014 р.

Tuning the MapReduce job

java.lang.OutOfMemoryError: GC overhead limit exceeded
 that's what I got yesterday while running my new shining MapReduce job.

OutOfMemory in java has different reasons: no more memory available, or GC was called to often (my case), no more free PermGem space, etc.

To get more information, about JVM internals we have to tune JVM runing. I'm using Hortonworks distribution, so I went to Ambari, MapReduce configuration tab and found mapreduce.reduce.java.opts  This property is responsible for reducer's JVM configuration. Let's add GarbageCollector loggining
-verbose:gc -Xloggc:/tmp/@taskid@.gc -XX:+PrintGCDetails -XX:+PrintGCTimeStamps
We set up to write GC log to local filesystem in folder tmp, file name - taskId + gc extension.

In general, the following properties are important for JVM tuning:

  • mapred.child.java.opts - Provides JVM options to pass to map and reduce tasks. Usually includes the -Xmx option to specify the maximum heap size. May also specify -Xms to specify the start heap size. 
  • mapreduce.map.java.opts - Overrides mapred.child.java.opts for map tasks.
  • mapreduce.reduce.java.opts - Overrides mapred.child.java.opts for reduce tasks.
After entering new value for property, the MapReduce service must be restarted (Hortonworks reming with yellow button "Restart"). Only after restart changes woulb be applied. Next step is to run map reduce job, and in result the logs per task woulb be placed into tmp folder on each node.

It'a but diffiulty to read the log, but hopefulyl several UI tools exist on the market. i prefer the open sourced GCViewer, which is java application and doesn't require instalation. It supports wide range of JVM, moreove it has command line interface for generation reports - so automation for getting reports might be applied.

The open GC log gets the detail overview of memory state:

Legend:

  • Green line that shows the length of all GCs
  • Magenta area that shows the size of the tenured generation (not available without PrintGCDetails)
  • Orange area that shows the size of the young generation (not available without PrintGCDetails)
  • Blue line that shows used heap size

четвер, 9 жовтня 2014 р.

Unit test for Hive query

Sometimes the soul wants something really extraordinaly... for example, to write a unit test for Hive query :)

Let's how it is possible step be step. So, to write unit test for Hive:

First of all, the local hive instance must be run, and for that we need local metastor (I propose Apache Derby) and directories for temporary data, logs, etc. As all configuration will be read from system properties, I didn't find beter way then set up all of them programaticaly...
Be shure to create all mentioned directories before starting Hive, for example with google Guava:

FileUtils.forceMkdir(HIVE_BASE_DIR);

And after then register all of them in system environment:

        System.setProperty("javax.jdo.option.ConnectionURL", "jdbc:derby:;databaseName=" + HIVE_METADB_DIR.getAbsolutePath() + ";create=true");
        System.setProperty("hive.metastore.warehouse.dir", HIVE_WAREHOUSE_DIR.getAbsolutePath());
        System.setProperty("hive.exec.scratchdir", HIVE_SCRATCH_DIR.getAbsolutePath());
        System.setProperty("hive.exec.local.scratchdir", HIVE_LOCAL_SCRATCH_DIR.getAbsolutePath());
        System.setProperty("hive.metastore.metadb.dir", HIVE_METADB_DIR.getAbsolutePath());
        System.setProperty("test.log.dir", HIVE_LOGS_DIR.getAbsolutePath());
        System.setProperty("hive.querylog.location", HIVE_TMP_DIR.getAbsolutePath());
        System.setProperty("hadoop.tmp.dir", HIVE_HADOOP_TMP_DIR.getAbsolutePath());
        System.setProperty("derby.stream.error.file", HIVE_BASE_DIR.getAbsolutePath() + sep + "derby.log");

After that, the local hive executor might be started:
HiveInterface client = new HiveServer.HiveServerHandler();

In fact, we are ready in this moment. Now I propose to create a Hive table, load data into it and perform some queries. The best practice in Java world is to put all metadata/data for test in separate file, so I put them under resources directory in this example, and here is reading from resource text files:
client.execute(readResourceFile("/Example/table_ddl.hql"));
client.execute("LOAD DATA LOCAL INPATH '" +
                this.getClass().getResource("Example/data.csv").getPath() + "' OVERWRITE  INTO TABLE " + tableName);

Ok, now data in the table and Hive knows about them. Let's perform a query:
client.execute("select sum(revenue), avg(revenue) from " + tableName + " group by state");

Even more, we can register custom function and test it!
client.execute("ADD JAR " + HIVE_BASE_DIR.getAbsolutePath() + jar.getAbsoluteFile());
client.execute("CREATE TEMPORARY FUNCTION TempFun as 'org.my.example.MainFunClass'");

And after that we can call fresh function:
client.execute("select TempFun(revenue) from " + tableName);
String revenueProcessed = client.fetchOne();