Accordint to Wikipedia, "A Bloom filter is a space-efficient probabilistic data structure, conceived by Burton Howard Bloom in 1970, that is used to test whether an element is a member of a set. False positive matches are possible, but false negatives are not, thus a Bloom filter has a 100% recall rate. In other words, a query returns either "possibly in set" or "definitely not in set"."
Also, I found this site wich give a very goo description of Bloom filter with perfect visualization, please check
As it is clear from Bloom filter definition, this datastructure can really help when we need to filter some records. Particularly, performing join: in this case we can transform small dataset into filter, and then apply filter on map stage in second MR, which perform a real join. In other words, we will have 2 MR when 1st is used for creating filter and 2nd is used to perform filtrtion on map and join on reduce.
Ok, first MepReduce contains 2 stages: mapper and reducer, because in result we should got exactly one Bloom filter object:
- initialize BloomFilter object as Mapper clas member: BloomFilter = new BloomFilter(10000, 10, hash.MURMUR_HASH)
- on each record, add it to filter: filter.add( new Key(str.getBytes()) );
- emmit data only in cleanup method, for example you can just write file withoutusing context at all
Your filter is prepared now, it can be desiarilized at any place and used for data filtration.