четвер, 16 січня 2014 р.

Predicted Age of Abalone based on physical measurements

Abalone dataset is freely available at UCI Machine Learning Repository since 1995. It contains result of abalone research in Australia. Predicting the age of abalone from physical measurements. The age of abalone is determined by cutting the shell through the cone, staining it, and counting the number of rings through a microscope - a boring and time-consuming task. Other measurements, which are easier to obtain, are used to predict the age. Definitely, the task is more complex in the real conditions and further information, such as weather patterns and location (hence food availability) may be required to solve the problem.

So, Age ~ Rings and must be predicted from the set of different measures as Diameter, Weight, Height, Length, etc. It is supervised learning task, because of the dataset with relation Result~Features is provided. Simple check shows numbers of rings from 1 to 29 and it is huge range for classification. Another supervised learning algorithm is a linear regression. 

EDA (exploratory data analysis) is a first step before building any model and there is the code for loading dataset into memory and plotting several relations, for example Rings~Diameter

library(ggplot2)
 
# read dataset from local file
abalone <- read.csv("/Users/kostya/Downloads/abalone.data.csv", header=F)
 
# set names for dataframe columns
colnames(abalone) <- c('Sex', 'Length', 'Diameter', 'Height', 'WholeWeight', 'ShuckedWeight',
                    'VisceraWeight', 'ShellWeight', 'Rings')
 
# plot histogram
hist(abalone$Rings, freq=F)
 
# depicture all charts on one plot
qplot(Diameter, Rings, data=abalone, geom=c("point", "smooth"), method="lm", color=Sex, se=F)


This image (as well as other relations like Rings~WholeWeight, etc) shows pretty well difference relations for each sex and the first thought is to apply different regression for each 'sex' or use 'sex' as a factor.

For example, go on with different regression models, we need to construct formula by investigating each relations. For example, there is Rings~WholeWeight relation 

# plot each sex on different plot
ggplot(abalone, aes(VisceraWeight, Rings)) + 
  geom_jitter(alpha=0.25) + 
  geom_smooth(method=lm, se=FALSE) +
  facet_grid(. ~ Sex)


Obvious, that for Male and Infant relations has logarithmic trend and it will be logically to add 'log' in formula. 


summary(lm(Rings~Length+I(Diameter^2)+log(WholeWeight)+log(ShellWeight)+log(ShuckedWeight)
           +Height+VisceraWeight, data=subset(abalone, Sex %in% 'I'))  )
 
summary(lm(Rings~Length+I(Diameter^2)+log(WholeWeight)+log(ShellWeight)+ShuckedWeight
           +Height+VisceraWeight, data=subset(abalone, Sex %in% 'M'))  )
 
summary(lm(Rings~Length+I(Diameter^2)+WholeWeight+ShellWeight+ShuckedWeight
           +Height+VisceraWeight, data=subset(abalone, Sex %in% 'F'))  )


As result the next formula may be constructed to predict number of rings for Infant based on coefficient of linear regression:
Rings= 8.5398 - 7.6755*Length + 8.7707*Diameter^2 + 1.4837*log(WholeWeight) + 2.0745*log((ShellWeight) -2.3415*log(ShuckedWeight) + 27.8275*Height + 5.9972*VisceraWeight

As was mentioned in task description Age=Rings+1.5