An eccentric billionaire recently decided it would be a brilliant idea to recreate the least successful cruise liner to ever exist. I disagree, but let’s continue.
So you’re thinking about taking the inaugural journey on the Titanic II but deathly afraid of icebergs? If the Titanic Jr lives up to its authentic claims, lack of lifeboats and all, we can create some models to predict survivability.
With the power of data, let’s discover how to survive the Titanic so you can continue on making terrible life decisions.
##Titanic Manifest 2200 passengers: 1700 of which died and, for you math majors, 500 survived.
Each passenger has an age, gender, socio-economic class, ticket price, location boarded, number of siblings, number of parents, and a name (but that’s a worthless variable).
##Predicting Survival How do we go about creating prediction models? We’ll start off ‘manually’ analyzing the data to determine survival probabilities. Afterwards, we’ll try our luck with some machine learning algorithms.
When I say ‘manually’ analyze data, of course, I mean with code. Specifically we’ll be using Python. Nerdier types might be questioning why I’m not using a language made for data analysis like R. My answer is simply: numpy pandas.
Within code snippets, I’ll refer to pandas as
pd and numpy as
##Basic Intuition Okay, we’ve got our tools, but where do we start? Let’s explore our basic intuition: ‘women and children first’. All we need to do is take the number of women on-board and divide it by the number of women who survived.
Proportion of women who survived: 0.74203821 which translates to
74% for us humans.
Okay, 74% survival rate among women seems like a promising premise to build off.
Lets create a crude model that says something along the lines of
if 'male' then
'dead'. Here’s the idea in python:
The results of this simple model are surprisingly sound. Applied to the test
data, we get back an accuracy result of about
77%. Not bad for basic
###Sorry Kids So we’ve seen that being a woman is advantageous, but does age matter?
Proportion of adult women who survived: 0.771029.
Sorry kids. Despite what your parents say, you’re just not important enough to be considered a predictive variable. This is probably a result of gaps in the data, which we’ll discuss later on. For now we’ll just leave the children behind in our model.
###Make Way For The Rich Let’s try out our second intuition: ‘women and rich people first’. At this point the code is getting quite lengthy so I’ll just cover the results.
Grouping people by their wealth seems to be a clear indicator of survival. Let’s
adjust the model to say anyone with over a 50% probability can live. Now it
looks something like
if 'male' and 'poor' then 'double dead'.
The result of taking out 3rd class females results in our test data being about 2% more accurate. We can live (so long as you’re a rich female) with 79% but let’s see if we can improve our accuracy.
###Letting Men Live Up until now our model hasn’t left any room in the lifeboats for men, but surely a few survived.
Let’s try and make some room with machine learning algorithms.
##Machine Learning Finally! Let’s learn some machines.
mind-numbing frustration fun of machine learning is cleaning up the
data. Computers have a hard time inherently understanding strings, so we have to
convert everything to an integer. For instance:
The SciKit package for python has a random forest classifier that does most of the heavy lifting for us. I’ll try to break down the concept so we can get an idea of what’s going on under the hood.
###Decision Trees A decision tree is a flowchart-esque way of illustrating an algorithm. If you don’t know what flowcharts or algorithms are you can think of it as a map of possible decisions and their resulting outcomes. Going back to the first model we created, part of the decision tree looked like this:
Now that we have the computer’s help, we can get more detailed with our probabilities. Subsequently this means letting men live every once in a while. Here’s an example of a small subsection (it gets real complex, real quick) of the new decision tree:
###Random Forests™ In the beginning we were analyzing all of the training data in order to create our prediction models. Our first iterations used all of the sample passengers (rows) and a couple variables (columns) at a time.
This machine learning technique approaches the problem a bit differently. Instead of analyzing all of the data to create the model, it only takes a random chunk of it. It applies the method of bagging to create a model using a random set of passengers (rows) and all of the given variables (columns).
Random Forests takes bagging a step further by randomly selecting both the samples and variables to create each tree. This method is repeated x number of times (x being the number of trees in your forest). The mode of the forest (average of all the trees) is the model used for prediction.
Here’s an overarching snippet of code to conceptualize the process:
Just to be sure we’re all on the same page: the idea of this particular algorithm is to create a bunch of slightly varying decision trees by selecting a random subset of both the samples and variables. The final model is created from the average of the entire forest. Simple, right?
If you’re into this, definitely read Leo Breiman’s original paper.
###Results Against the Machine After fighting your way through all this technical jargon, the results are certainly anticlimactic. A meager boost of 1-2% is all we get.
##Conclusion So what does this all mean?
Small data sets are shit for machine learning.
Simple can be powerful.
If you want to survive the Titanic, don’t be poor or a man.
However, if you fall into either of those two categories, your best chance of survival is asking that paramour of yours to quit crying and make some room on the driftwood.