A reanalysis of the Titanic data set in R and Quarto
Statistical report
Objectives
We will learn Quarto by a practical example: a reanalysis of the Titanic data. Quarto enables you to put together content, executable code, tables, and figures into a finished document. To learn more about Quarto, see https://quarto.org.
In the analysis below, I stick to the following structure. This structure can also be seen in the table of contents on the right.
Here is a somewhat more detailed Data Analysis Workflow that I’m using in my own projects.
A final word of caution.
“A $100 analysis can make a $1,000,000 study worthless” –Frank Harrell
Primary goal
To identify factors associated with passenger survival in the 1912 Titanic incidence.
Data import
Usually this step involves reading an Excel file, data cleaning, merging different data sets, reshaping data, etc. Here, it is very simple. We load the data from the package carData
. We also do quality control with the package testit
.
If you don’t have the required packages already installed, you first need to install them.
install.packages("carData")
install.packages("testit")
Code
library(carData)
library(testit)
After loading the package you can find the data in the object TitanicSurvival
; I will rename it to data
for simplicity. Inspect the data. What variables are available, and what type of data is it (continuous, binary, categorical, etc.)?
Code
= TitanicSurvival data
Now, I show a random set of passengers (rows in the data).
Code
set.seed(2024)
= sample(nrow(data), 6)
idx data[idx,]
survived | sex | age | passengerClass | |
---|---|---|---|---|
Richards, Master. George Sibley | yes | male | 0.8333 | 2nd |
Cacic, Mr. Luka | no | male | 38.0000 | 3rd |
Vendel, Mr. Olof Edvin | no | male | 20.0000 | 3rd |
Karaic, Mr. Milan | no | male | 30.0000 | 3rd |
Andersen, Mr. Albert Karvin | no | male | 32.0000 | 3rd |
Eustis, Miss. Elizabeth Mussey | yes | female | 54.0000 | 1st |
Data processing
This data set is very tidy and clean. We don’t need to do a lot of processing.
Histogram of age
However, it is important to get an overview of the data. I need to understand how the data looks like, for example, by plotting histograms for all continuous data variables.
Code
hist(data$age, main = "Distribution of age", xlab = "Age (years)",
col = "#001E7C", border = "grey90")
I don’t like the label “Frequency”; change it to “Counts”.
Definition of outcome
Next, I define the primary outcome with a new variable called event
. It is a binary outcome set to TRUE
when the passenger survived, and FALSE
otherwise.
Code
$event = data$survived == "yes" data
Missing data
A very important step is to assess and address missing data. I found that age
has missing data, so I report the number of missing values in age (%).
Code
= sum(is.na(data$age))
count = sum(is.na(data$age))/nrow(data)*100
prc
= data.frame(age = sprintf("%d (%.1f%%)", count, prc))
tab tab
age |
---|
263 (20.1%) |
Create analysis dataset
We will do a complete case analysis. Actually, this is not really recommended; however, in this tutorial, it is fit for purpose. The analysis data set is stored in the object titanic
.
How can missing data be addressed otherwise?
Code
= complete.cases(data)
idx = data[idx,] titanic
Quality control
Quality control is a crucial step and should be part of any data analysis pipeline. A test is like a question to the data that should be true, for example:
- Do all passengers have a value in
passengerClass
(no missing)? - Are all the values in
age
larger than 0 and smaller than 100?
As soon as the logical expression within the function assert
is not met, an error will be thrown.
Code
assert(!is.na(titanic$passengerClass))
You can now extend the code block above. You could test the other variable sex
for completeness. You can try to test that the age range is between 0 and 100, for example. You can test anything.
Test the obvious.
Descriptive statistics
I report some descriptives of the Titanic passengers, but I’m lazy and just report the sample size and age.
Code
= quantile(titanic$age, probs = c(0.5, 0.25, 0.75))
q_age
= data.frame(
tab
Descriptives = c(
sprintf("N=%d", nrow(titanic)),
sprintf("%0.f (%0.f--%0.f)", q_age[1], q_age[2], q_age[3])
)
)
rownames(tab) = c("Sample Size",
"Age, median (IQR)")
tab
Descriptives | |
---|---|
Sample Size | N=1046 |
Age, median (IQR) | 28 (21–39) |
You can now try to extend Table 1 and also report the count (and %) for categorical variables sex
and passengerClass
.
Primary analysis
We fit a logistic regression model to predict the binary outcome “survived: yes”. I also set the reference levels to my liking.
Code
$sex = factor(titanic$sex, levels = c("male", "female"))
titanic$passengerClass = factor(titanic$passengerClass, levels = c("3rd", "2nd", "1st"))
titanic
= glm(event ~ sex + age + passengerClass, data = titanic, family = binomial)
fit
= summary(fit)$coefficients
tab as.data.frame.matrix(tab)
Estimate | Std. Error | z value | Pr(>|z|) | |
---|---|---|---|---|
(Intercept) | -1.2654312 | 0.2028675 | -6.237722 | 0e+00 |
sexfemale | 2.4978447 | 0.1660365 | 15.043949 | 0e+00 |
age | -0.0343932 | 0.0063310 | -5.432511 | 1e-07 |
passengerClass2nd | 1.0090908 | 0.1983625 | 5.087104 | 4e-07 |
passengerClass1st | 2.2896606 | 0.2258019 | 10.140129 | 0e+00 |
However, this table is not looking nice and is also hard to interpret. I can explain why. Therefore, I improve this table, see Table 2. That is how I like to report a logistic model.
You can execute the lines in the code block below one by one and try to understand what is going on.
Code
= as.data.frame.matrix(tab)
tab $OR = as.character(signif(exp(tab$Estimate), digits = 3))
tab$z.value.fmt = as.character(signif(tab$`z value`, digits = 3))
tab$pvalue = ggsurvfit::format_p(tab$`Pr(>|z|)`)
tab
# add 95% confidence interval
= confint(fit, level = 0.95)
ci = as.character(signif(exp(ci[,1]), digits = 2))
lower = as.character(signif(exp(ci[,2]), digits = 2))
upper
$CI = sprintf("from %s to %s", lower, upper)
tab
= tab[, c("OR", "CI", "z.value.fmt", "pvalue")]
tab.tidy colnames(tab.tidy) = c("Odds ratio", "95% CI", "z value", "p value")
tab.tidy
Odds ratio | 95% CI | z value | p value | |
---|---|---|---|---|
(Intercept) | 0.282 | from 0.19 to 0.42 | -6.24 | <0.001 |
sexfemale | 12.2 | from 8.8 to 17 | 15 | <0.001 |
age | 0.966 | from 0.95 to 0.98 | -5.43 | <0.001 |
passengerClass2nd | 2.74 | from 1.9 to 4.1 | 5.09 | <0.001 |
passengerClass1st | 9.87 | from 6.4 to 15 | 10.1 | <0.001 |
The row names can be improved. Try it.
Now we are done. We found that on the Titanic, it was not very ideal to be a 3rd class, a male, or an older passenger. And it was certainly not ideal to be all of this (it is an additive model). I will discuss the table in more detail if you wish.
Any suggestion to improve this analysis? Also, if you have seen any error, typo, or any other mishap in this document, please let me know. I try to improve it every year.
Computing information
Code
sessionInfo()
R version 4.4.1 (2024-06-14 ucrt)
Platform: x86_64-w64-mingw32/x64
Running under: Windows 10 x64 (build 19045)
Matrix products: default
locale:
[1] LC_COLLATE=English_Switzerland.utf8 LC_CTYPE=English_Switzerland.utf8
[3] LC_MONETARY=English_Switzerland.utf8 LC_NUMERIC=C
[5] LC_TIME=English_Switzerland.utf8
time zone: Europe/Zurich
tzcode source: internal
attached base packages:
[1] stats graphics grDevices utils datasets methods base
other attached packages:
[1] testit_0.13 carData_3.0-5
loaded via a namespace (and not attached):
[1] vctrs_0.6.5 cli_3.6.3 knitr_1.48 rlang_1.1.4
[5] xfun_0.48 generics_0.1.3 jsonlite_1.8.9 glue_1.8.0
[9] colorspace_2.1-1 htmltools_0.5.8.1 scales_1.3.0 fansi_1.0.6
[13] rmarkdown_2.28 grid_4.4.1 munsell_0.5.1 evaluate_1.0.1
[17] tibble_3.2.1 fastmap_1.2.0 yaml_2.3.10 lifecycle_1.0.4
[21] compiler_4.4.1 ggsurvfit_1.1.0 dplyr_1.1.4 htmlwidgets_1.6.4
[25] pkgconfig_2.0.3 lattice_0.22-6 digest_0.6.37 R6_2.5.1
[29] tidyselect_1.2.1 utf8_1.2.4 splines_4.4.1 pillar_1.9.0
[33] magrittr_2.0.3 Matrix_1.7-0 gtable_0.3.5 tools_4.4.1
[37] survival_3.7-0 ggplot2_3.5.1