четвер, 24 липня 2014 р.

Hadoop 2.2 Distributed Cache and Map Join

It's very common to use Distributed Cache for Map joins - it gives a possibility to implement extremely fast join of huge dataset with a small one(s). Comparing to other join techniques you can win up to 1000x speed up, so Map joins are extremely useful and widely used. It's the easiest way to implement outer join, non-equie join and so on, I'd recommend to use Map join always when it is possible.

What is bad about Hadoop and I don't like it - they change API very often, each new version has changes in API. The most weird example: interface Mapper. It was introduces, then deprecated and then dedepricated (in Hadoop 2 it's without @Deprecated)... oh, quite difficult to manage all changes...

The last changes:  DistributedCache is now deprecated. And you can't use the old good DistributedCache.addCacheFile

In the new Hadoop 2.x the new approach introduced:
1) add file to distributed cache (I'm using symlink here):
job.addCacheFile(new URI(conf.get("dimension.file")+"#YOUR_DIM"));

2) in your setup method (Mapper or Reducer) the data from cache might be read with following instruction:
Path[] files = context.getLocalCacheFiles(); // oh, this method is again deprecated ym_-)

// loop over all files in cache
for (Path p : files) {
    if (p.getName().equals("YOUR_DIM")) {
         // load cache (for example into Map)
    }
}

That's all, symlink are very useful for accessing file from cache.

Немає коментарів:

Дописати коментар