In other words, you want to find all items from your sails that are sold together, for example: people usually buy chips with beer. There are several algorithms and one of them is Apriori algorithm which is available in R and implemented in 'arules' package.
To see how it works let generate dataset at first (w/ the next Groovy script):
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
class Constants { | |
static final goods = [ | |
'Apple', 'Orange', 'Pineaple', 'Cherry', 'Beef', 'Sugar', 'Milk', | |
'Pear', 'Limon', 'Blubbery', 'IceCream', 'Cake', 'Orange', 'Milk', | |
'Cola', 'Bread', 'Coffe', 'Cookies', 'Beer', 'Tea', 'Salmon', 'Steal Water', 'Nuts'] | |
} | |
rand = new Random() | |
def createProduct() { | |
return Constants.goods[ rand.nextInt( Constants.goods.size() ) ] | |
} | |
def createTransaction(int itemsNum) { | |
List<String> list = new ArrayList<String>(); | |
for(int i = 0; i < itemsNum; i++) { | |
String newProduct = createProduct(); | |
if(list.contains(newProduct)) { | |
i--; | |
}else{ | |
list.add(newProduct); | |
} | |
} | |
// create relation betweens Steal Water and Nuts | |
if(list.contains('Nuts') && !list.contains('Steal Water') && (rand.nextInt(10)+1) % 7 != 0) { | |
list.add('Steal Water') | |
} | |
return list.join(',') | |
} | |
def createDataSet(int recordsNum) { | |
for(int i = 0; i < recordsNum; i++){ | |
System.out.println( createTransaction(rand.nextInt(5)+4) ) | |
} | |
} | |
createDataSet(250) |
As result, you will get a set of transactions were each line represents one transaction and contains list of item, for example:
Orange,Pineaple,Steal Water,Milk,Blubbery
As you can see from the code (line 27) we create link between Nuts and Steal water with probability around 90%, so it means that in output more than 90% transaction that contains Nuts will contain Steal water. Transaction creating process based on uniform distribution with doubled Oranges and Milk (see init list of products). Ok, in the result of basket analysis we expect to see Nuts - Steal water pair. It can be done with next R script:
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
#Install the R package arules if required | |
#install.packages("arules"); | |
#load the arules package | |
library("arules"); | |
# read the transaction file as a Transaction class | |
txn = read.transactions(file="d:\\work\\R\\items.csv", rm.duplicates= FALSE, format="basket",sep=","); | |
#To visualize the item frequency in txn file | |
itemFrequencyPlot(txn); | |
# Run the apriori algorithm | |
basket_rules <- apriori(txn,parameter = list(sup = 0.2, conf = 0.9,target="rules")); | |
# Check the generated rules using inspect | |
inspect(basket_rules); |
file – csv/txt
format – single/basket. For ‘basket’ format, each line in the transaction data file represents a transaction where the items (item labels) are separated by the characters specified by sep. For ‘single’ format, each line corresponds to a single item, containing at least ids for the transaction and the item.
rm.duplicates – TRUE/FALSE
sep – separator for csv
as result of visualization (line 10) you should get something similar to
As you see Orange and Milk were sold much more frequency that any other product that was expected.
In the line 13 we run analytic and we asked to find "all item sets that was sold in 20% or more transaction and in more than 90% case they were sold together". We aim to find association rules, however there are several other possible targets described in the documentation. The output from line 15 is:
lhs rhs support confidence lift
1 {Nuts} => {Steal Water} 0.236 0.9365079 2.364919
The above rule means “If Nuts is brought then there is 93% likelihood of purchase of Steal Water”. The support 0.23 indicates that 23% of the transaction in the data involve nuts purchases. The confidence of 93% indicates out of the transactions which involve Nuts and Steal water. The third parameter, lift, indicates quality of this rule. In general, lift less than 1 means that rule is wrong. [read good explanation] So, take into consideration only rules with lift more than 1, in our case lift equals to 2.3, which means it's very good rule and we can believe in it.
So, what's about performance? To test it, I created 200 MB data set which contains 1,000,000 transaction with about 10,000 items, each transaction contains from 5 to 35 items (uniform distribution). Several rules were created manually in this data set. The result of calling
system.time( basket_rules ... )
is about 10s on my desktop. So, it's much faster then I expected
Немає коментарів:
Дописати коментар