понеділок, 30 вересня 2013 р.


The short history how R meets Hadoop:

  1.  Hadoop + R via streaming
  2.  This one 

So, R through Streaming on Hadoop was discussed in a previous article. It's obvious that using streaming is not the best approach on Hadoop because some additional issues are appearing. To facilitate development with R on Hadoop RHadoop was created.

RHadoop is set of R packages aim to facilitate writing MapReduce code in R on Hadoop. It still uses streaming, but brings the following advantages:
  • don’t need to manage key change in Reducer
  • don’t need to control functions output manually
  • simple MapReduce API for R
  • enables access to files on HDFS 
  • R code can be run on local env/Hadoop without changes

субота, 28 вересня 2013 р.

Hadoop + R via streaming

The short history how R meets Hadoop

R is language for Stats, Math and Data Science created by statisticians for statisticians. It contains 5000+ implemented algorithms and impressive 2M+ users with domain knowledge worldwide. However, it has one big disadvantage - all data is placed into memory ... in one thread.

And there is Hadoop. New, powerful framework for distributed data processing. Hadoop is built upon idea of MapReduce algorithm, this isn't something very specific, a lot of languages have MR capabilities, but Hadoop brought it to the new level.The main idea of MR is:

  1. Map step: Map(k1,v1) → list(k2,v2) 
  2. Magic here 
  3. Reduce step: Reduce(k2, list (v2)) → list(v3)

Hadoop was developed in Java and Java is the main programming languages for Hadoop. Although Java is main language, you can still use any other language to write MR: for example, Python, R or OCaml. It is called "Streaming API"

четвер, 5 вересня 2013 р.

Useful unix commands

1. Let's assume you have very long CSV file and you wish to find all non-empty columns #25, so:
tail -n +1 data.csv | cut -d ',' -f 25 | grep -v "^[[:space:]]*$"

2. In the previous file you need to find number of all rows with value 'CA' in the column 25:
cat -n data.csv | cut -d ',' -f 25,1 | grep -e "CA.*"

In output, you will get column value as well as row number
3. In the huge file you wish to get row #250007 in separate file:
sed -n '250007p' data.csv | cat > line_250007.csv

4. To get number of tabulation in line
tr -cd \t < line_250007.csv | wc -c

5. Ensure file copy is complete before starting process or ensure that file is not used by other process right now:
if [ $(lsof $file | wc -l) -gt 1 ] ; then
    echo "file $file still loading, skipping it"
    echo "file $file completed upload, process it"

6. Color script output for better visualisation
NORMAL=$(tput sgr0)
GREEN=$(tput setaf 2; tput bold)
RED=$(tput setaf 1)

function red() {
    echo -e "$RED$*$NORMAL"

function green() {
    echo -e "$GREEN$*$NORMAL"

red "This is error message"

7. Find all files in directory by regexp:
for FILE in $(ls $DIR | grep ^report_.*csv$)

8. Check previous command result
mkdir -p $WRK_DIR
if [ $? -ne 0 ]; then
  #do something

9. Extract 12 hours from NOW and cast to specific format:
date -d "+12 hours" +\%Y\%m\%d\%H

вівторок, 3 вересня 2013 р.

Several Apache Pig trips and ticks

1. Counters in Pig

PigStatusReporter reporter = PigStatusReporter.getInstance();
if (reporter != null) {

2. Pick up latest version of JAR
Incredible simple way to use always the last version of jar file without code changes

%default elephantBirdJar `hadoop fs -ls /tmp/libs/elephant-bird-core*jar | awk '{print $8;}' | sort -n | head -1`
register 'hdfs://$elephantBirdJar'

3. Call Java code without UDF from Pig script
It's pity that you have to write UDF each time when java call is required, even for the one line of code. There is a way to call built-in java functions without writing UDF and it's called Dynamic Invokers. For example, java.net.URLDecoder#decode method is called in the next example:

DEFINE UrlDecode InvokeForString('java.net.URLDecoder.decode', 'String String');
encoded_strings = LOAD 'data.txt' as (encoded:chararray);
decoded_strings = FOREACH encoded_strings GENERATE UrlDecode(encoded, 'UTF-8');

4. Set timeout for long-running UDF
Sometimes UDF can require much more time than it is expected, there is a way to stop long-running UDF automatically by Pig with @MonitoredUDF annotation (more information available here)

/* Timeout for UDF is 10 seconds, if no result. thna default will be returned; 
pay close attention, only several types are suported to be returned, i. e. there are not tuples or bags */

@MonitoredUDF(timeUnit = TimeUnit.MILLISECONDS, duration = 10000, intDefault = 10)
 public class MyUDF extends EvalFunc<Integer> {
   /* implementation goes here */