Jon Karrer

Titanic Tabular Binary Classification

The sinking of the Titanic is one of the most infamous shipwrecks in history.

On April 15, 1912, during her maiden voyage, the widely considered “unsinkable” RMS Titanic sank after colliding with an iceberg. Unfortunately, there weren’t enough lifeboats for everyone onboard, resulting in the death of 1502 out of 2224 passengers and crew.

While there was some element of luck involved in surviving, it seems some groups of people were more likely to survive than others.

In this challenge, we ask you to build a predictive model that answers the question: “what sorts of people were more likely to survive?” using passenger data (ie name, age, gender, socio-economic class, etc).

Goal

It is your job to predict if a passenger survived the sinking of the Titanic or not. For each in the test set, you must predict a 0 or 1 value for the variable.

Metric

Your score is the percentage of passengers you correctly predict. This is known as accuracy.

Submission File Format

You should submit a csv file with exactly 418 entries plus a header row. Your submission will show an error if you have extra columns (beyond PassengerId and Survived) or rows. The file should have exactly 2 columns:

PassengerId (sorted in any order) Survived (contains your binary predictions: 1 for survived, 0 for deceased)

Data

The dataset used in this challenge can be found on kaggle.

| Variable | Definition | Key | | -------- | ------------------------------------------ | ---------------------------------------------- | | survival | Survival | 0 = No, 1 = Yes | | pclass | Ticket class | 1 = 1st, 2 = 2nd, 3 = 3rd | | sex | Sex | | | age | Age in years | | | sibsp | # of siblings / spouses aboard the Titanic | | | parch | # of parents / children aboard the Titanic | | | ticket | Ticket number | | | fare | Passenger fare | | | cabin | Cabin number | | | embarked | Port of Embarkation | C = Cherbourg, Q = Queenstown, S = Southampton |

Cleaning

Often, raw data is a bit messy. Let's clean it up.

Missing data

We see age is missing 177, cabin is missing 687, and embarked is missing 2.

The mode for age is 24, the mode for cabin is "C23 C25 C27", and the mode for embarked is "S".

Simple enough, we can fill in the missing values.

Outliers

There are 3 fare outliers, we will just remove them.

Visualize

Visualizing data can help with insight and weaknesses in the data. Those weaknesses need to be amended and the insights need to be leveraged.

We will use Plotly to create plots.

Here are some examples:

Correlations

Correlations can help with finding relationships between variables. Strong correlations can be indicative of a linear relationship.

Insights

Classification

Now we need to create our features and classifications.

The fields that we are interested in are:

These will be our features. The survival field will be our classification.

Data Points

A single data point for this will be a 1D tensor of length 6 for the features, and a 1D tensor of length 1 for the classification. We group this into a struct called DataPoint. Simply looping through the data set and creating the data points will do the trick.

Batch

We need to create 2 tensors for our model to interpret the data. Our feature tensor will be a 2D tensor, with shape [798, 6]. That's 798 rows and 6 columns. Our target tensor will be a 2D tensor, with shape [798, 1]. This will be considered a batch that the model can interpret. Since we are dealing with little data, we can just create a single batch.

Model

Now we need to create the model to send our batches to. A simple regression neural network has these characteristics:

Loss Function

We will simply forward the batches through the model's layers, and calculate the loss. For this task, we will use the Mean Squared Error (MSE) loss function.

Optimizer

Now we need to choose the optimizer and learning rate. For this we will use Stochastic Gradient Descent (SGD) with a learning rate of 1e-4 to start.

Training

Now we need to bring it altogether and train our model. We loop through epochs, and on each epoch we pass the batch through the model, calculate the loss, and backpropagate the gradients. Repeat until trained.

Results

Run One

Configuration:

Score:

Run Two

Configuration:

Score:

Run Three

Configuration:

Score:

Run Four

Configuration:

Score:

Submission

The model is now ready to be submitted. We run the Kaggle test csv through our model and submit our predictions. We got an official score of 0.64832. Not wonderful, but it was a great challenge.

Further Improvements

We used the most basic and lightly correlated information that was easy to digest in rust. The important part of this project was to learn how to build a neural network in rust, from scratch, with a custom training loop. But if we were to go for Gold, here's what we could do: