вівторок, 11 листопада 2014 р.

Spark and Location Sensitive Hashing, part 2

This is a second part of topic about Locality Sensitive Hashing, and here is example of creating working example using Apache Spark.

Let's start from definition of task: there are two datasets - bank accounts and web-site visitors. In common, they have only name, but it's possible misspeling. Let's consider the following example:

Bank Accounts

Name Tom Soyer Andy Bin Tom Wiscor Tomas Soyér
Credit score 10 20 30 40

Web-site Visitors

Name Tom Soyer Andrew Bin Tom Viscor Thomas Soyer
email 1@1 2@1 3@1 2@2

пʼятницю, 7 листопада 2014 р.

Spark and Location Sensitive Hashing, part 1

Location Sensitive Hashing is the name of special algorithm designed to address complexity of BigData processing.



Let's consider the follwoing example: assume we have two independent systems, one is web-application that gets user's profile from social network, second system is online payment system. Our idea is merge profiles from social network and payment system. Of course, the social network user might not be presented in payment system at all, cerate accounts in different time and definetely we don't have foreign key to match them exactly. There are two possible issues:

  • there are two huge data sets that must be merged
  • an user's name might look different in social network  and payment system 
The naive approach is to compare social network user and payment system user names, calculate Hamming distance between them and pick up the most similar pair as successfuly matched. The biggest issue here is O(n2) complexity of this approach.

We want to minimize a number of comparison between two datasets. Hopefully, this issue was resolved by inventing Location Sensitive Hashing algorithm. Let's consider simple hashing: 
f(str) → x
we can calculate hashing function f on string (user name from profile) s and get integer x; then we need to compare Hamming distances only for strings which have the same x. The issue here is to pick up very good hashing function, which is almost impossible. Hopefully, we are not limited by one function: we can apply several/tens/hundreds hashing functions - in this case we would have data duplication, because one string would be assigned to several buckets (hash value). It would increase the number of useles comparisons, but at the some moment we would have a bigger chance to get succesful comparison.

However, it wouldn't work good enough, because names might have misprintings and using special lettern in social profile when only traditional latin in payments system or vice versa. n-grams and minhashing might come in handy in this situation. The main idea is to get all possible n-grams for string and apply minhashing algorithm to them. In result, we aims to get a set of new hash codes based on n-grams and make comparison of string that was placed into the same buckets based on these hashcodes.

Step by step algorithm is next:

  1. Define a collection of hash functions
  2. Calculate minhash function on n-gramm of profile by minhash algo
  3. Based on equals hashcodes get pairs of similar profiles from social and payment networks
  4. Calculate Hamming distance in pairs to select the most similar matching for each case

In next part: source code example and implementation over Apache Spark