вівторок, 3 вересня 2013 р.

Several Apache Pig trips and ticks

1. Counters in Pig

PigStatusReporter reporter = PigStatusReporter.getInstance();
if (reporter != null) {

2. Pick up latest version of JAR
Incredible simple way to use always the last version of jar file without code changes

%default elephantBirdJar `hadoop fs -ls /tmp/libs/elephant-bird-core*jar | awk '{print $8;}' | sort -n | head -1`
register 'hdfs://$elephantBirdJar'

3. Call Java code without UDF from Pig script
It's pity that you have to write UDF each time when java call is required, even for the one line of code. There is a way to call built-in java functions without writing UDF and it's called Dynamic Invokers. For example, java.net.URLDecoder#decode method is called in the next example:

DEFINE UrlDecode InvokeForString('java.net.URLDecoder.decode', 'String String');
encoded_strings = LOAD 'data.txt' as (encoded:chararray);
decoded_strings = FOREACH encoded_strings GENERATE UrlDecode(encoded, 'UTF-8');

4. Set timeout for long-running UDF
Sometimes UDF can require much more time than it is expected, there is a way to stop long-running UDF automatically by Pig with @MonitoredUDF annotation (more information available here)

/* Timeout for UDF is 10 seconds, if no result. thna default will be returned; 
pay close attention, only several types are suported to be returned, i. e. there are not tuples or bags */

@MonitoredUDF(timeUnit = TimeUnit.MILLISECONDS, duration = 10000, intDefault = 10)
 public class MyUDF extends EvalFunc<Integer> {
   /* implementation goes here */

Немає коментарів:

Дописати коментар