Improvements in Apache Pig 0.12
Assert operator
An assert operator can be used for data validation. For example, the following script will fail if any value is a negative integer:
a = load 'something' as (a0:int, a1:int); assert a by a0 > 0, 'a cant be negative for reasons';
Streaming UDF
Idealy Users can now write a UDF using a language without JVM implementations. Currently, CPython has been implemented
Rewrite of AvroStorage
We completely revamped the AvroStorage. It is now part of Pig built-in functions. It uses the latest version of Avro and is significantly faster.
IN operator
Previously, Pig had no support for IN operators. To mimic those, users had to concatenate several OR operators, but now:
a = LOAD '1.txt' USING PigStorage(',') AS (i:int);
b = FILTER a BY i IN (1,22,333,4444,55555);
CASE expression
Here’s an example of the type of CASE expression that Pig now supports:
bar = FOREACH foo GENERATE ( CASE i % 3 WHEN 0 THEN '3n' WHEN 1 THEN '3n+1' ELSE '3n+2' END );
BigInteger/BigDecimal data types
Some applications require calculations with a high degree of precision. In these cases BigInteger and BigDecimal can be used for more precise calculations.
Support for Microsoft Windows™
Changes that enable Apache Pig to run on Windows without Cygwin have now been committed to the trunk.
Parquet Support
Pig now wraps ParquetLoader/ParquetStorer in built-in functions for Parquet format.
Improvements in Apache Pig 0.11
The new
CUBE
and ROLLUP
operators of the equivalent SQL operators provide the ability to easily compute aggregates over multi-dimensional data. Here is an example:events = LOAD '/logs/events' USING EventLoader() AS (lang, country, app_id, event_id, total); eventcube = CUBE events BY CUBE(lang, country), ROLLUP(app_id, event_id); result = FOREACH eventcube GENERATE FLATTEN(group) as (lang, country, app_id, event_id), COUNT_STAR(cube), SUM(cube.total); STORE result INTO 'cuberesult';The
CUBE
operator produces all combinations of cubed dimensions. The ROLLUP
operator produces all levels of a hierarchical group
Groovy UDFs
Support for UDFs in Groovy is added, providing an easy bridge for converting Groovy and Pig data types and specifying output schemas via annotations.
The
DateTime
data type has been added to make it easier to work with timestamps. You can now do date and time arithmetic directly in a Pig script, use UDFs such asCurrentTime
, AddDuration
, WeeksBetween
, etc. PigStorage expects timestamps to be represented in the ISO 8601 format.
The new RANK operator allows one to assign an ordinal number to every tuple in a relation. A user can specify whether she wants exact rank (elements with the same sort value get the same rank) or ‘
DENSE
’ rank (elements with the same sort value get consecutive rank values). One can also rank by a field value, in which case the relation is sorted by this field prior to ranks being assigned. Unfortunately, only 1 reducer!
@OutputSchema annotation
nstead of implementing a getOutputSchema function, UDF authors can tell Pig their output schema by annotating the UDF:
class GroovyUDF {
@OutputSchema(x:long)
long square(long x){
return x*x
}
}
PrimitiveEvalFunc class
extending PrimitiveEvalFunc allows the UDF author to skip all the tuple unwrapping business and simply implement
public OUT exec(IN input)
, where IN and OUT are primitives.
mock.Storage
mock.Storage, a helper StoreFunc to simplify writing JUnit tests for your pig scripts, was quietly added in 0.10.1 and got a couple of bug fixes in 0.11. See details inmock.Storage docs.
Sharing code with Guava
FunctionWrapperEvalFunc allows one to easily wrap Guava functions FunctionSetting the pig.udf.profile property to true will turn on counters that approximately measure the number of invocations and milliseconds spent in all UDFs and Loaders. Use this with caution, as this feature can really bloat the number of counters your job uses! Useful for lightweight debugging of jobs.
Boolean Data Type
Pig 0.10 introduces boolean data type as a first-class Pig data type.
a = load ‘input’ as (a0:boolean, a1:tuple(a10:boolean, a11:int), a2);
When loading boolean data using PigStorage, Pig expects the text “true” (ignore case) for a true value, and “false” (ignore case) for a false value; while other values map to null. When storing boolean data using PigStorage, true value will emit text “true” and false value will emit text “false”.
Nested Cross/Foreach
You can use nested cross and nested foreach statements inside foreach nested plan in Pig 0.10. Here is one example:
C = cogroup user by uid, session by uid; D = foreach C { crossed = cross user, session; filtered = filter crossed by user::region == session::region; result = foreach filtered generate processSession(user::age, user::gender, session::ip); -- processSession is a UDF generate result; }
Note the maximum level of nested plan is 2
JRuby UDF
you can now use JRuby UDFs
Default Split Destination
Split will automatically identify inputs that don’t belong to any of the other branches and direct those inputs to the “otherwise” destination:
split a into b if id > 3, c if id < 5, d otherwise;
Globbing in Register
Pig now supports globbing in register statements:
register lib/*.jar
This is for storing a .pig_schema along a data file when when using PigStorage. When loading data from PigStorage, Pig will check the existence of .pig_schema and use it automatically:store a into 'output_dir' using PigStorage('\t', '-schema');PigStorage now adds a new column INPUT_FILE_NAME, which indicates the input file name of that input.a = load 'input_dir' using PigStorage('\t', '-tagsource');The first column of the output will be INPUT_FILE_NAMEKill Hadoop JobIf you kill a Pig job using Ctrl-C or “kill”, Pig will now kill all associated Hadoop jobs currently running. This is applicable to both grunt mode and non-interactive mode.New Features in Apache Pig 0.9.0 Macros
With the 0.9 macro feature, you can write a macro to do this:DEFINE row_count(X) RETURNS Z {
Y = group $X all; $Z = foreach Y generate COUNT($X);
};
Pig embedding script
A common complain of Pig is the lack of control flow statements: if/else, while loop, for loop, etc.And now Pig has a response for it: Pig embedding. You can now write a python program and embed Pig scripts inside of it, leveraging all language features provided by Python, including control flow.The Pig embedding API is similar to the database embedding API. You will compile statement, bind to parameter, execute statement and then iterate through cursor. The Pig embedding document provides an excellent guide on how the Pig embedding API works.Project-range expressionIf the schema had columns in following order – (user, age, gender, ip, start_date, rank, activity_summary, friend_list,privacy_setting).In Pig 0.9, the query can now be written using a project-range expression as:input_city_state = FOREACH input GENERATE user .. gender, flatten(getCityState(ip)), start_date .. ;
after reading this blog i got more useful information from this blog.. thanks a lot for sharing this blog to us
ВідповістиВидалитиhadoop training institute in chenni | big data training institute in chennai | hadoop training in velachery | big data training in velachery
informative blog.. blog having very clear explanation so easy and interesting to read
ВідповістиВидалитиhadoop training institute in tambaram | big data training institute in tambaram | hadoop training in chennai tambaram | big data training in chennai tambaram