How to present XML in Hive flat table after XSLT transformation
Let's start from defining a task. Imaging that the dataset is a set of XML files and the requirement is to present some specific information from this file as simple flat structure. Let's illustrate:
Definetely, we can use SerDe for XML, but what if XML structure is not defined before hand and we want to give end-user a chance to control parsing process? One of possible solutions is to incorporate XSLT to transform XML to desired format.
A bit late I will reveal how XML might be applied from Hive query, but now let's focus on XSLT.
Highlevel XSLT looks like:
Let's store this XSLT into transformation.xslt file.
We are going to use TRANFORM functionality from Hive. Groovy contains realy straighforward way to call XSLT transformation as it might be used to run XSLT transformation from Hive. This blogpost http://www.pleus.net/blog/?p=1448 contains a great overwrite how to do that. Afterthat, we can store groovy file as run-transformation.groovy. Dont' forget to pass file path to XSLT file as argument
And the last step, is to prepare HQL file which will contain Hive script and run transformation on cluster in distributed mode: