понеділок, 18 серпня 2014 р.

Writing in ElasticSearch directly from Hadoop MapReduce

ElasticSearch is a hot topic today. This is powerful open source search and analytics engine that makes data easy to explore. Several times I faced with data populating into ElasticSearch after Hadoop jobs completion.  A couple years it was non trivial issue that requires using binary ElasticSearch client and publishing data manually. Hopefully, there is already support by EalsticSearch for Hadoop today.

Let's see how it might be done with a simplest case: we have to put JSON formatted data into ElasticSearch for further analysis. So, our purpose is to write Map-only job that will populate ElasticSearch with data from text file (already in JSON).

First of all, let configure Configuration object:

        conf.setBoolean("mapred.map.tasks.speculative.execution", false);
        conf.setBoolean("mapred.reduce.tasks.speculative.execution", false);
        conf.set("es.resource", "emailIndex/email"); // intex/type
        conf.set("es.nodes", ""); // host
        conf.set("es.port", "11000"); // port
        conf.set("es.input.json", "yes");

I guess, everything is clear here.

Very important is to set up correct output format, pay attention on register:

        // Set input and output format classes

        // Specify the type of output keys and values

 After that we will implement Mapper (it emits only value, without key - this behavior is required by ES output format class!):

public static class EmailToEsMapper extends org.apache.hadoop.mapreduce.Mapper<LongWritable, Text, NullWritable, Text> {
        private Text output = new Text();

        protected void map(LongWritable key, Text value, Context context) throws IOException, InterruptedException {
            String email = value.toString();

            context.write(NullWritable.get(), output);          


Let's back to the second code snippet. There is EsOutputFormat, pay attention on register, because there is old deprecated API with ESOutputFormat class.It might be required to add exclusion to Maven file, to pull correct versions of jars and omit dependencies hell:

yarn cascading</groupId> cascading-hadoop cascading cascading-local </exclusion> org.apache.pig pig org.apache.hive hive-service

середа, 13 серпня 2014 р.

Geo Coordinates converting

I've made discovery working on the last task: could you imagine that there are many many many geographical coordinate systems in the world? I couldn't. I was pretty sure that there is only one: longitude and latitude.

Surprise! There are much more of them and they are widely popular. Some of them are used in particular domain, some of them are specific for some countries. For example, you can read more about Gauss–Krüger coordinate system.

import org.geotools.geometry.GeneralDirectPosition;
import org.geotools.referencing.CRS;
import org.opengis.geometry.DirectPosition;
import org.opengis.referencing.FactoryException;
import org.opengis.referencing.NoSuchAuthorityCodeException;
import org.opengis.referencing.crs.CoordinateReferenceSystem;
import org.opengis.referencing.operation.MathTransform;
import org.opengis.referencing.operation.TransformException;

public strictfp double[] translate(String from, String to, double x, double y)
            throws FactoryException, NoSuchAuthorityCodeException, TransformException {

        CoordinateReferenceSystem sourceCRS = CRS.decode( from );
        CoordinateReferenceSystem targetCRS = CRS.decode( to );

        MathTransform transform = CRS.findMathTransform(sourceCRS, targetCRS, true);

        DirectPosition expPt = new GeneralDirectPosition(x, y);
        expPt = transform.transform(expPt, null);
        return expPt.getCoordinate();

Ok, it looks good. One time consuming issue - it's to include correct libraries with Maven, because this small piece of code has very wide dependencies and it took several hours to manage correct combination :)

So, maven dependencies: