Kostiantyn's blog: Pig's evolution

Improvements in Apache Pig 0.12

Assert operator

An assert operator can be used for data validation. For example, the following script will fail if any value is a negative integer:

a = load 'something' as (a0:int, a1:int);
assert a by a0 > 0, 'a cant be negative for reasons';

Streaming UDF

Idealy Users can now write a UDF using a language without JVM implementations. Currently, CPython has been implemented

Rewrite of AvroStorage

We completely revamped the AvroStorage. It is now part of Pig built-in functions. It uses the latest version of Avro and is significantly faster.

IN operator

Previously, Pig had no support for IN operators. To mimic those, users had to concatenate several OR operators, but now:

a = LOAD '1.txt' USING PigStorage(',') AS (i:int);

b = FILTER a BY i IN (1,22,333,4444,55555);

CASE expression

Here’s an example of the type of CASE expression that Pig now supports:

bar = FOREACH foo GENERATE ( 
  CASE i % 3 
     WHEN 0 THEN '3n' 
     WHEN 1 THEN '3n+1' 
     ELSE '3n+2' 
  END 
);

BigInteger/BigDecimal data types

Some applications require calculations with a high degree of precision. In these cases BigInteger and BigDecimal can be used for more precise calculations.

Support for Microsoft Windows™

Changes that enable Apache Pig to run on Windows without Cygwin have now been committed to the trunk.

Parquet Support

Pig now wraps ParquetLoader/ParquetStorer in built-in functions for Parquet format.

Improvements in Apache Pig 0.11

CUBE and ROLLUP Operators

The new CUBE and ROLLUP operators of the equivalent SQL operators provide the ability to easily compute aggregates over multi-dimensional data. Here is an example:

events = LOAD '/logs/events' USING EventLoader() AS (lang, country, app_id, event_id, total);
eventcube = CUBE events BY
 CUBE(lang, country), ROLLUP(app_id, event_id);
result = FOREACH eventcube GENERATE
  FLATTEN(group) as (lang, country, app_id, event_id),
  COUNT_STAR(cube), SUM(cube.total);
 STORE result INTO 'cuberesult';

The CUBE operator produces all combinations of cubed dimensions. The ROLLUP operator produces all levels of a hierarchical group

Groovy UDFs

Support for UDFs in Groovy is added, providing an easy bridge for converting Groovy and Pig data types and specifying output schemas via annotations.

DateTime Data Type

The DateTime data type has been added to make it easier to work with timestamps. You can now do date and time arithmetic directly in a Pig script, use UDFs such asCurrentTime, AddDuration, WeeksBetween, etc. PigStorage expects timestamps to be represented in the ISO 8601 format.

RANK Operator

The new RANK operator allows one to assign an ordinal number to every tuple in a relation. A user can specify whether she wants exact rank (elements with the same sort value get the same rank) or ‘DENSE’ rank (elements with the same sort value get consecutive rank values). One can also rank by a field value, in which case the relation is sorted by this field prior to ranks being assigned. Unfortunately, only 1 reducer!

@OutputSchema annotation

nstead of implementing a getOutputSchema function, UDF authors can tell Pig their output schema by annotating the UDF:

class GroovyUDF {

@OutputSchema(x:long)

long square(long x){

return x*x

}

PrimitiveEvalFunc class

extending PrimitiveEvalFunc allows the UDF author to skip all the tuple unwrapping business and simply implement public OUT exec(IN input), where IN and OUT are primitives.

mock.Storage

mock.Storage, a helper StoreFunc to simplify writing JUnit tests for your pig scripts, was quietly added in 0.10.1 and got a couple of bug fixes in 0.11. See details inmock.Storage docs.

Sharing code with Guava

FunctionWrapperEvalFunc allows one to easily wrap Guava functions Function which contain the core logic, and keep UDF-specific code minimal.

UDF profiling
Setting the pig.udf.profile property to true will turn on counters that approximately measure the number of invocations and milliseconds spent in all UDFs and Loaders. Use this with caution, as this feature can really bloat the number of counters your job uses! Useful for lightweight debugging of jobs.

New Features in Apache Pig 0.10

Boolean Data Type

Pig 0.10 introduces boolean data type as a first-class Pig data type.

a = load ‘input’ as (a0:boolean, a1:tuple(a10:boolean, a11:int), a2);

When loading boolean data using PigStorage, Pig expects the text “true” (ignore case) for a true value, and “false” (ignore case) for a false value; while other values map to null. When storing boolean data using PigStorage, true value will emit text “true” and false value will emit text “false”.

Nested Cross/Foreach

You can use nested cross and nested foreach statements inside foreach nested plan in Pig 0.10. Here is one example:

C = cogroup user by uid, session by uid;
D = foreach C {
    crossed = cross user, session;
    filtered = filter crossed by user::region == session::region;
    result = foreach filtered generate processSession(user::age, user::gender, session::ip); -- processSession is a UDF
    generate result;
}

Note the maximum level of nested plan is 2

JRuby UDF

you can now use JRuby UDFs

Default Split Destination

Split will automatically identify inputs that don’t belong to any of the other branches and direct those inputs to the “otherwise” destination:

split a into b if id > 3, c if id < 5, d otherwise;

Globbing in Register

Pig now supports globbing in register statements:

register lib/*.jar


Improvements to PigStorage

This is for storing a .pig_schema along a data file when when using PigStorage. When loading data from PigStorage, Pig will check the existence of .pig_schema and use it automatically:
store a into 'output_dir' using PigStorage('\t', '-schema');



PigStorage now adds a new column INPUT_FILE_NAME, which indicates the input file name of that input.
a = load 'input_dir' using PigStorage('\t', '-tagsource');

The first column of the output will be INPUT_FILE_NAME



Kill Hadoop Job

If you kill a Pig job using Ctrl-C or “kill”, Pig will now kill all associated Hadoop jobs currently running. This is applicable to both grunt mode and non-interactive mode.


New Features in Apache Pig 0.9.0

Macros


With the 0.9 macro feature, you can write a macro to do this:


DEFINE row_count(X) RETURNS Z {

  Y = group $X all; $Z = foreach Y generate COUNT($X);

};

Pig embedding script


A common complain of Pig is the lack of control flow statements: if/else, while loop, for loop, etc.

And now Pig has a response for it: Pig embedding. You can now write a python program and embed Pig scripts inside of it, leveraging all language features provided by Python, including control flow.

The Pig embedding API is similar to the database embedding API. You will compile statement, bind to parameter, execute statement and then iterate through cursor. The Pig embedding document provides an excellent guide on how the Pig embedding API works.



Project-range expression

If the schema had columns in following order – (user, age, gender, ip, start_date, rank, activity_summary, friend_list,privacy_setting).

In Pig 0.9, the query can now be written using a project-range expression as:
input_city_state = FOREACH input GENERATE user .. gender, flatten(getCityState(ip)), start_date .. ;

Kostiantyn's blog

четвер, 24 жовтня 2013 р.

Pig's evolution

Boolean Data Type

Nested Cross/Foreach

JRuby UDF

2 коментарі:

четвер, 24 жовтня 2013 р.

Pig's evolution

Boolean Data Type

Nested Cross/Foreach

JRuby UDF

2 коментарі:

четвер, 24 жовтня 2013 р.