четвер, 9 жовтня 2014 р.

Unit test for Hive query

Sometimes the soul wants something really extraordinaly... for example, to write a unit test for Hive query :)

Let's how it is possible step be step. So, to write unit test for Hive:

First of all, the local hive instance must be run, and for that we need local metastor (I propose Apache Derby) and directories for temporary data, logs, etc. As all configuration will be read from system properties, I didn't find beter way then set up all of them programaticaly...
Be shure to create all mentioned directories before starting Hive, for example with google Guava:


And after then register all of them in system environment:

        System.setProperty("javax.jdo.option.ConnectionURL", "jdbc:derby:;databaseName=" + HIVE_METADB_DIR.getAbsolutePath() + ";create=true");
        System.setProperty("hive.metastore.warehouse.dir", HIVE_WAREHOUSE_DIR.getAbsolutePath());
        System.setProperty("hive.exec.scratchdir", HIVE_SCRATCH_DIR.getAbsolutePath());
        System.setProperty("hive.exec.local.scratchdir", HIVE_LOCAL_SCRATCH_DIR.getAbsolutePath());
        System.setProperty("hive.metastore.metadb.dir", HIVE_METADB_DIR.getAbsolutePath());
        System.setProperty("test.log.dir", HIVE_LOGS_DIR.getAbsolutePath());
        System.setProperty("hive.querylog.location", HIVE_TMP_DIR.getAbsolutePath());
        System.setProperty("hadoop.tmp.dir", HIVE_HADOOP_TMP_DIR.getAbsolutePath());
        System.setProperty("derby.stream.error.file", HIVE_BASE_DIR.getAbsolutePath() + sep + "derby.log");

After that, the local hive executor might be started:
HiveInterface client = new HiveServer.HiveServerHandler();

In fact, we are ready in this moment. Now I propose to create a Hive table, load data into it and perform some queries. The best practice in Java world is to put all metadata/data for test in separate file, so I put them under resources directory in this example, and here is reading from resource text files:
client.execute("LOAD DATA LOCAL INPATH '" +
                this.getClass().getResource("Example/data.csv").getPath() + "' OVERWRITE  INTO TABLE " + tableName);

Ok, now data in the table and Hive knows about them. Let's perform a query:
client.execute("select sum(revenue), avg(revenue) from " + tableName + " group by state");

Even more, we can register custom function and test it!
client.execute("ADD JAR " + HIVE_BASE_DIR.getAbsolutePath() + jar.getAbsoluteFile());
client.execute("CREATE TEMPORARY FUNCTION TempFun as 'org.my.example.MainFunClass'");

And after that we can call fresh function:
client.execute("select TempFun(revenue) from " + tableName);
String revenueProcessed = client.fetchOne();

1 коментар:

  1. I recently needed to solve this problem too, but I did it a little differently. I dug thru the Hive source code to see how they are doing it. I put together an article and github repo with info on what I learned


    Its something to consider, and hopefully it will save someone some time...