The Titanic dataset on Kaggle (http://www.kaggle.com/c/titanic-gettingStarted) consists of ~1300 records (the passengers of the titanic). The data is broken up, you are only given ~900 records of whether or not someone survived the Titanic disaster (training set) and your job is to predict whether or not ~400 passengers would have died based on a set of factors. You are given their age, name, sex, ticket #, class (first, second, third), and a handful of other variables. At first glance, you will see over 1000 teams and individual data scientists that are all using the most sophisticated algorithms the R language (and SciKits in python) has to offer. Random forests, support vector machines, and gradient boosting are all popular choices. In addition, people are offering their code on the forums to show exactly how to achieve 70%, 74%, and even 79% accuracy. This makes getting past an average score incredibly difficult.
My first attempt (a random forest) landed me at 72% accuracy, throwing me in the bottom of the barrel. I knew I was doing something wrong, since random forests are incredibly powerful.
These are pretty simple to build in R. Just import your data, then load the package, then run the algorithm on your target variable… so :
titanic<-read.csv(“train.csv”) #loads your data
library(rpart) #loads the R package to build models
fit<-rpart(Target ~ variable1 + variable2 + variable3, parameters, data=titanic) #basic predictive model
fit<-rpart(Survived ~ Pclass + Sex + Age + SibSp + Parch+ Fare + Embarked, method = “class”, data=training) #an actual example
I figured tinkering with the parameters of the algorithms wouldn’t get me anywhere, so I switched gears. I noticed the data was missing a large portion of the [age] variable. So I decided I would use an algorithm (Amelia package, named after the missing Amelia Earhart) to impute the missing values to create more data, then build my random forest off that. It turns out that making guesses based off a bunch of other guesses isn’t a great idea, so I did even worse. FYI – you should only impute missing variables when you are missing a small amount of data. The [name] variable in the dataset is quite interesting, as it contains Mr. | Mrs. | Miss. | Master. | Dr. | Major. | ect ect…. It turns out that using the titles of the names is actually better than using someones age to predict whether or not they will die (for this dataset at least). Initially I ignored the ticket numbers because I assumed I could not do anything useful with unique character values. After thinking about it for awhile, I realized ships are probably broken up into departments (duh) – which lead me to stripping the first character in the string to make a new variable. For example, if the ticket # was P2938, I just called it “P”. People in certain sections of the ship had a higher probability of dying.
Visualizing a basic decision tree (part of your random forest) can help you understand underlying relationships in the data.
Always check how your variables are performing in the random forest to gauge usefulness.
Check the error in your boosting to determine the proper number of iterations to run.
Altering the data I was given to make it more useful it was caused me to jump from the rank 950 to rank ~100. Then I decided to try different algorithms. Random forests are generally the “best” out there, but I tried gradient boosting and support vector machines as well. I found the algorithms mostly predicted the same people would die, with a few exceptions. My final model was a vote, I took the mode of the 3 models as my final decision as to whether or not someone would survive, and this lead me to the top 5 % of the competitors.