And this is time for Composite join: map-side join on huge datasets. In fact, both datasets must meet several requirements in this case:
- The datasets are all sorted by the join key
- Each dataset has the same number of file (you can achive that by setting reducers number)
- File N in each dataset contains the same join key K
- Each file is not splitable
In this case you can perform map join to join block from dataset A versus block from dataset B. Hadoop API provides CompositeInputFormat to achive this requirement. Example of usage:
// in job configuration you have to set job.setInputFormatClass(CompositeInputFormat.class); // inner - reference to inner join (you can specify outer as well) // d1, d2 - Path to both datasets job.getConfiguration().set(CompositeInputFormat.JOIN_EXPR, CompositeInputFormat.compose("inner", KeyValueTextInputFormat.class, d1, d2)); job.setNumReduceTasks(0);
The mapper with have key-value pair of type Text, TupleWritable:
@Override public void map(Text key, TupleWritable value, Context ctx) { ... }
hive.input.format=org.apache.hadoop.give.ql.io.BucketizedHiveInputFormat;
hive.optimize.bucketmapjoin=truel
hive.optimize.bucketmapjoin.sortedmerge=true;
Ofcourse, it requires all the keys to be sorted in both tables and then must be bucketized in the same number of buckets
Немає коментарів:
Дописати коментар