четвер, 25 липня 2013 р.

Market basket analysis with R

Affinity analysis is a data analysis and data mining technique that discovers co-occurrence relationships <...>. In retail, affinity analysis is used to perform market basket analysis, in which retailers seek to understand the purchase behavior of customers. This information can then be used for purposes of cross-selling and up-selling, in addition to influencing sales promotions, loyalty programs, store design, and discount plans. [from Wikipedia]
In other words, you want to find all items from your sails that are sold together, for example: people usually buy chips with beer. There are several algorithms and one of them is Apriori algorithm which is available in R and implemented in 'arules' package.

To see how it works let generate dataset at first (w/ the next Groovy script):
As result, you will get a set of transactions were each line represents one transaction and contains list of item, for example:
Orange,Pineaple,Steal Water,Milk,Blubbery
As you can see from the code (line 27) we create link between Nuts and Steal water with probability around 90%, so it means that in output more than 90% transaction that contains Nuts will contain Steal water. Transaction creating process based on uniform distribution with doubled Oranges and Milk (see init list of products). Ok, in the result of basket analysis we expect to see Nuts - Steal water pair. It can be done with next R script: Pay attention to the line 7, method read list of transaction from the file and can be configured with the next options:
file – csv/txt
format – single/basket. For ‘basket’ format, each line in the transaction data file represents a transaction where the items (item labels) are separated by the characters specified by sep. For ‘single’ format, each line corresponds to a single item, containing at least ids for the transaction and the item.
rm.duplicates – TRUE/FALSE
sep – separator for csv

as result of visualization (line 10) you should get something similar to

As you see Orange and Milk were sold much more frequency that any other product that was expected.
In the line 13 we run analytic and we asked to find "all item sets that was sold in 20% or more transaction and in more than 90% case they were sold together". We aim to find association rules, however there are several other possible targets described in the documentation. The output from line 15 is:
lhs rhs support confidence lift
1 {Nuts} => {Steal Water} 0.236 0.9365079 2.364919
The above rule means “If Nuts is brought then there is 93% likelihood of purchase of Steal Water”. The support 0.23 indicates that 23% of the transaction in the data involve nuts purchases. The confidence of 93% indicates out of the transactions which involve Nuts and Steal water. The third parameter, lift, indicates quality of this rule. In general, lift less than 1 means that rule is wrong. [read good explanation] So, take into consideration only rules with lift more than 1, in our case lift equals to 2.3, which means it's very good rule and we can believe in it.

So, what's about performance? To test it, I created 200 MB data set which contains 1,000,000 transaction with about 10,000 items, each transaction contains from 5 to 35 items (uniform distribution). Several rules were created manually in this data set. The result of calling
system.time( basket_rules ... )
is about 10s on my desktop. So, it's much faster then I expected

Немає коментарів:

Дописати коментар