Titanic Survival Prediction
Kaggle competitions
By Hana Lê in EDA Binary Classification Machine learning
April 5, 2023
1. Introduction
This project uses data from the Kaggle competition “Titanic - Machine Learning from Disater”. The goal of this project is to use machine learning to build a predictive model that predicts which passengers survived the Titanic shipwreck, to answers the question: “what sorts of people were more likely to survive?” using passenger data (ie. name, age, gender, socio-economic class, etc).
It seems like a straightforward project that is quite popular among beginners in data science 😃, but my objective is to gain more experience working with this type of dataset.
2. Overview the data
2.1 Loading packages and reading the data
# Loading R packages
packages <- c("tidyverse","Amelia", "psych","DT","mice","ranger", "janitor","vcd","kableExtra", "stringr","ggplot2", "GGally", "randomForest")
sapply(packages, require, character = TRUE)
# Reading data
train <- read.csv("titanic_data/train.csv")
test <- read.csv("titanic_data/test.csv")
2.2 Data size
The Titanic train data set has 891 obs and 12 variables with the response variable Survived. The Titanic test data set has 418 obs and 11 variables.
dim(train); dim(test)
## [1] 891 12
## [1] 418 11
# Combine 2 data sets to see the structure, and for cleaning & feature engineering later.
test$Survived <- NA
data <- rbind(train, test)
dim(data)
## [1] 1309 12
The data now has 12 columns consisting of 11 predictors and the response variable Survived.
2.3 Data Structure
str(data)
## 'data.frame': 1309 obs. of 12 variables:
## $ PassengerId: int 1 2 3 4 5 6 7 8 9 10 ...
## $ Survived : int 0 1 1 1 0 0 0 0 1 1 ...
## $ Pclass : int 3 1 3 1 3 3 1 3 3 2 ...
## $ Name : chr "Braund, Mr. Owen Harris" "Cumings, Mrs. John Bradley (Florence Briggs Thayer)" "Heikkinen, Miss. Laina" "Futrelle, Mrs. Jacques Heath (Lily May Peel)" ...
## $ Sex : chr "male" "female" "female" "female" ...
## $ Age : num 22 38 26 35 35 NA 54 2 27 14 ...
## $ SibSp : int 1 1 0 1 0 0 0 3 0 1 ...
## $ Parch : int 0 0 0 0 0 0 0 1 2 0 ...
## $ Ticket : chr "A/5 21171" "PC 17599" "STON/O2. 3101282" "113803" ...
## $ Fare : num 7.25 71.28 7.92 53.1 8.05 ...
## $ Cabin : chr "" "C85" "" "C123" ...
## $ Embarked : chr "S" "C" "S" "S" ...
Some categorical variables should be in factor form:
var_cat <- c("Survived", "Pclass","Sex", "Embarked")
data[, var_cat] <- data.frame(lapply(data[, var_cat], as.factor))
2.3 Missingness of the data
It seems from the data structure that missing values are not only represented as ‘NA’, but also as empty.
sort(colSums(is.na(data)| data == ""), decreasing = T)
## Cabin Survived Age Embarked Fare PassengerId
## 1014 418 263 2 1 0
## Pclass Name Sex SibSp Parch Ticket
## 0 0 0 0 0 0
Out of a total of 1309 records, Cabin is the variable with the highest number of missing values, with 1014 records (77.5%) missing. Age is the second variable with the most missing values, with 263 records (20.1%) missing. Embarked has only 2 missing values, while Fare has only 1 missing value.
# Visualization of missing values
data[data ==""] <- NA
missmap(data[,colnames(data) != "Survived"])
2.4 Imputing missing data
# Imputing missing values for Embarked
data[is.na(data$Embarked), ]
## PassengerId Survived Pclass Name
## 62 62 1 1 Icard, Miss. Amelie
## 830 830 1 1 Stone, Mrs. George Nelson (Martha Evelyn)
## Sex Age SibSp Parch Ticket Fare Cabin Embarked
## 62 female 38 0 0 113572 80 B28 <NA>
## 830 female 62 0 0 113572 80 B28 <NA>
data %>% tabyl(Embarked, Pclass) %>%
kable() %>%
kable_styling(full_width = FALSE)
Embarked | 1 | 2 | 3 |
---|---|---|---|
0 | 0 | 0 | |
C | 141 | 28 | 101 |
Q | 3 | 7 | 113 |
S | 177 | 242 | 495 |
NA | 2 | 0 | 0 |
#names(sort(table(data$Embarked), decreasing = T))[1]
# These 2 passengers were in 1st class, in the same Cabin
# It seems likely the 2 passengers embarked from S = Southampton.
data$Embarked[is.na(data$Embarked)] <- "S"
# Imputing the missing value for Fare
data[is.na(data$Fare),]
## PassengerId Survived Pclass Name Sex Age SibSp Parch
## 1044 1044 <NA> 3 Storey, Mr. Thomas male 60.5 0 0
## Ticket Fare Cabin Embarked
## 1044 3701 NA <NA> S
## the passenger was in Pclass 3 and embarked from "S"
data$Fare[is.na(data$Fare)] <- median(data$Fare, data$Pclass == 3 & data$Embarked == "S", na.rm = T)
# I was wondering if there is any case where Cabin is exclusively for the 1st class passengers?
# Have look on Cabin
#data %>% tabyl(Pclass, Cabin)
It appears that passengers who had cabin information were predominantly in the 1st class. These cabins mostly contain letters from A to E.
According to ‘Wikipedia’ and ‘Titanic fandom’, first-class facilities and accommodation was located on the upper decks within the superstructure of the Titanic, which occupied almost the entirety of B and C Decks, but also large sections forward on A, D and E-Decks. While the third-class cabins were located on F Deck, with a few on the forward G Deck. Rather than numbered by the deck they were on, these cabins were numbered separately. This area was the first to flood during the sinking, because of their location in the lowest decks in the bow. It is possible that the location of the accommodation would have been affecting the chance of survival of passengers on the ship.
It is quite obvious to see significant association between Pclass with Cabin labeled from A to E by using Chi-Squared test in this case.
data$Cabin2 <- ifelse(is.na(data$Cabin), 0, 1)
Pclass_Cabin2 <- table(data$Pclass, data$Cabin2)
chisq.test(Pclass_Cabin2)
##
## Pearson's Chi-squared test
##
## data: Pclass_Cabin2
## X-squared = 794.43, df = 2, p-value < 2.2e-16
So I decided to drop the cabin variable from the analysis
data$Cabin <- NULL
# Imputing missing values for Age
# I'm going to use mice package for Age
# select only the columns needed for imputing Age
age_data <- data[, c('Age', 'Pclass', 'Sex', 'SibSp', 'Parch', 'Fare', 'Embarked')]
# impute missing values using MICE with the "random forest" method
set.seed(1234)
age_mice <- mice(age_data, method = "rf")
##
## iter imp variable
## 1 1 Age
## 1 2 Age
## 1 3 Age
## 1 4 Age
## 1 5 Age
## 2 1 Age
## 2 2 Age
## 2 3 Age
## 2 4 Age
## 2 5 Age
## 3 1 Age
## 3 2 Age
## 3 3 Age
## 3 4 Age
## 3 5 Age
## 4 1 Age
## 4 2 Age
## 4 3 Age
## 4 4 Age
## 4 5 Age
## 5 1 Age
## 5 2 Age
## 5 3 Age
## 5 4 Age
## 5 5 Age
age_imputed <- complete(age_mice)
data$Age <- age_imputed$Age
# Checking missing values
colSums(is.na(data[,colnames(data) != "Survived"]))
## PassengerId Pclass Name Sex Age SibSp
## 0 0 0 0 0 0
## Parch Ticket Fare Embarked Cabin2
## 0 0 0 0 0
2.5 Descriptive statistics
data_table <- describe(data)
data_table %>% round(digits = 3) %>%
DT::datatable(options = list(pageLength = 10))
<div class="datatables html-widget html-fill-item-overflow-hidden html-fill-item" id="htmlwidget-e022137f27b898f37092" style="width:100%;height:auto;"></div>
<script type="application/json" data-for="htmlwidget-e022137f27b898f37092">{"x":{"filter":"none","vertical":false,"data":[["PassengerId","Survived*","Pclass*","Name*","Sex*","Age","SibSp","Parch","Ticket*","Fare","Embarked*","Cabin2"],[1,2,3,4,5,6,7,8,9,10,11,12],[1309,891,1309,1309,1309,1309,1309,1309,1309,1309,1309,1309],[655,1.384,2.295,653.694,1.644,29.503,0.499,0.385,464.604,33.281,3.494,0.225],[378.02,0.487,0.838,377.31,0.479,14.17,1.042,0.866,278.039,51.741,0.814,0.418],[655,1,3,653,2,28,0,0,460,14.454,4,0],[655,1.355,2.368,653.619,1.68,28.943,0.275,0.175,465.234,21.568,3.616,0.157],[484.81,0,0,484.81,0,11.861,0,0,379.546,10.236,0,0],[1,1,1,1,1,0.17,0,0,1,0,2,0],[1309,2,3,1307,2,80,8,9,929,512.329,4,1],[1308,1,2,1306,1,79.83,8,9,928,512.329,2,1],[0,0.477,-0.597,0.002,-0.601,0.472,3.835,3.661,-0.009,4.36,-1.125,1.313],[-1.203,-1.775,-1.317,-1.202,-1.64,0.312,19.927,21.417,-1.328,26.896,-0.548,-0.276],[10.448,0.016,0.023,10.429,0.013,0.392,0.029,0.024,7.685,1.43,0.023,0.012]],"container":"<table class=\"display\">\n <thead>\n <tr>\n <th> <\/th>\n <th>vars<\/th>\n <th>n<\/th>\n <th>mean<\/th>\n <th>sd<\/th>\n <th>median<\/th>\n <th>trimmed<\/th>\n <th>mad<\/th>\n <th>min<\/th>\n <th>max<\/th>\n <th>range<\/th>\n <th>skew<\/th>\n <th>kurtosis<\/th>\n <th>se<\/th>\n <\/tr>\n <\/thead>\n<\/table>","options":{"pageLength":10,"columnDefs":[{"className":"dt-right","targets":[1,2,3,4,5,6,7,8,9,10,11,12,13]},{"orderable":false,"targets":0}],"order":[],"autoWidth":false,"orderClasses":false}},"evals":[],"jsHooks":[]}</script>
3. Exploring variables
3.1 Survived
data %>%
filter(!is.na(Survived)) %>%
tabyl(Survived) %>%
adorn_pct_formatting() %>%
kable() %>%
kable_styling(full_width = FALSE)
Survived | n | percent |
---|---|---|
0 | 549 | 61.6% |
1 | 342 | 38.4% |
3.2 Exploring potential predcitors of Survived
Passenger titles provide some information about their social status, gender, and possibly age which could be strong predictors for survival. These titles were included in the name of each passenger so we can extract them out.
data$Title <- sapply(data$Name, function(x) strsplit(x, split = '[,.]')[[1]][[2]])
data$Title <- sub(' ', '', data$Title)
data %>% tabyl(Title, Sex) %>%
kable() %>%
kable_styling(full_width = FALSE)
Title | female | male |
---|---|---|
Capt | 0 | 1 |
Col | 0 | 4 |
Don | 0 | 1 |
Dona | 1 | 0 |
Dr | 1 | 7 |
Jonkheer | 0 | 1 |
Lady | 1 | 0 |
Major | 0 | 2 |
Master | 0 | 61 |
Miss | 260 | 0 |
Mlle | 2 | 0 |
Mme | 1 | 0 |
Mr | 0 | 757 |
Mrs | 197 | 0 |
Ms | 2 | 0 |
Rev | 0 | 8 |
Sir | 0 | 1 |
the Countess | 1 | 0 |
data <- data %>%
mutate(Title = case_when(
Title %in% c("Mlle", "Ms") ~ "Miss",
Title == "Mme" ~ "Mrs",
Title %in% c("the Countess", "Dona", "Lady", "Jonkheer") ~ "Noble",
Title %in% c("Capt","Col", "Dr", "Rev", "Don", "Major", "Sir") ~ "Officer",
TRUE ~ Title
))
data %>% tabyl(Title, Sex) %>%
kable() %>%
kable_styling(full_width = FALSE)
Title | female | male |
---|---|---|
Master | 0 | 61 |
Miss | 264 | 0 |
Mr | 0 | 757 |
Mrs | 198 | 0 |
Noble | 3 | 1 |
Officer | 1 | 24 |
3.2.1 Potential continuous predictors of Survived
sub_con <- subset(data[!is.na(data$Survived),], select = c("Survived", "Age", "Fare"))
# Create scatter plot matrix with ggally
ggpairs(sub_con, aes(color = Survived, alpha = 0.7),
lower = list(combo = "count"))
The median age of the survivors appears to be slightly higher than that of non-survivors, indicating that older individuals had a higher chance of survival. On the other hand, the median fare of the surviving passengers is noticeably higher, suggesting a potential correlation between fare paid and survival rate.
3.2.2 Age categories
I read somewhere that priority to secure a place on lifeboats was given to “Women and children first”. So younger passenger should have better surviving odds.
I’m going to create age_cate to check is there was an association between age_cat and Survived.
data$age_cat <- cut(data$Age, breaks = c(0,18,40,60,95),
label = c("under18","middle", "senior", "elder"))
data %>%
filter(!is.na(Survived)) %>%
tabyl(age_cat, Survived) %>%
adorn_percentages("row") %>%
adorn_pct_formatting(rounding = "half up", digits = 0) %>%
adorn_ns(position = "front") %>%
knitr::kable()
age_cat | 0 | 1 |
---|---|---|
under18 | 85 (51%) | 81 (49%) |
middle | 354 (65%) | 193 (35%) |
senior | 90 (60%) | 61 (40%) |
elder | 20 (74%) | 7 (26%) |
#data %>%
#filter(!is.na(Survived)) %>%
#chisq.test(x = .$Survived, y =.$age_cat, correct = FALSE)
mosaic(~ age_cat + Survived, data = data[!is.na(data$Survived),], gp = shading_max, rot_labels = 0)
The association is significant (p-value < 0.05). Passengers who were under 18 had higher survived rate than other groups, whereas the elder had the lowest likelihood of surviving.
3.2.3 Sex
data %>%
filter(!is.na(Survived)) %>%
tabyl(Sex, Survived) %>%
adorn_percentages("row") %>%
adorn_pct_formatting(rounding = "half up", digits = 0) %>%
adorn_ns(position = "front") %>%
knitr::kable()
Sex | 0 | 1 |
---|---|---|
female | 81 (26%) | 233 (74%) |
male | 468 (81%) | 109 (19%) |
#chisq.test(data$Survived, data$Sex, correct = FALSE)
mosaic(~ Sex + Survived, data = data[!is.na(data$Survived),], gp = shading_max, rot_labels=0)
Female passengers had a noticeably higher rate of survival when compared to male passengers( p-value < 0.001).
3.2.4 Title
data %>%
filter(!is.na(Survived)) %>%
tabyl(Title, Survived) %>%
adorn_percentages("row") %>%
adorn_pct_formatting(rounding = "half up", digits = 0) %>%
adorn_ns(position = "front") %>%
knitr::kable()
Title | 0 | 1 |
---|---|---|
Master | 17 (43%) | 23 (58%) |
Miss | 55 (30%) | 130 (70%) |
Mr | 436 (84%) | 81 (16%) |
Mrs | 26 (21%) | 100 (79%) |
Noble | 1 (33%) | 2 (67%) |
Officer | 14 (70%) | 6 (30%) |
mosaic(~ Title + Survived, data = data[!is.na(data$Survived),], gp = shading_max, rot_labels=0)
The title that a passenger held had an impact on their odds of survival (p-value < 0.001). In line with the previous observation, it appears that women with titles of Miss or Mrs had a higher likelihood of surviving. Additionally, it is noteworthy that passengers with titles of Master or Noble had a better chance of survival when compared to those with titles of Mr or Officer.
3.2.4 Pclass
…..
4. Data preparation for modelling
5. Modelling
6. Conclusion
- Posted on:
- April 5, 2023
- Length:
- 9 minute read, 1907 words
- Categories:
- EDA Binary Classification Machine learning
- Tags:
- R Kaggle Competition