A reanalysis of the Titanic data set in R and Quarto

Statistical report

Author

Simon Schwab

Published

November 1, 2024

Abstract
An introduction to Quarto for the participants in the Good Research Practice (GRP) course at the University of Zurich.

Objectives

We will learn Quarto by a practical example: a reanalysis of the Titanic data. Quarto enables you to put together content, executable code, tables, and figures into a finished document. To learn more about Quarto, see https://quarto.org.

In the analysis below, I stick to the following structure. This structure can also be seen in the table of contents on the right.

flowchart TB
O("Objectives") --> D("Data import")
D --> P("Data processing")
P --> Q("Quality control")
Q --> DS("Descriptive statistics")
DS --> PA("Primary analysis")
PA --> C("Computing information")

Here is a somewhat more detailed Data Analysis Workflow that I’m using in my own projects.

A final word of caution.

“A $100 analysis can make a $1,000,000 study worthless” –Frank Harrell

Primary goal

To identify factors associated with passenger survival in the 1912 Titanic incidence.

Data import

Usually this step involves reading an Excel file, data cleaning, merging different data sets, reshaping data, etc. Here, it is very simple. We load the data from the package carData. We also do quality control with the package testit.

If you don’t have the required packages already installed, you first need to install them.

install.packages("carData")
install.packages("testit")
Code
library(carData)
library(testit)

After loading the package you can find the data in the object TitanicSurvival; I will rename it to data for simplicity. Inspect the data. What variables are available, and what type of data is it (continuous, binary, categorical, etc.)?

Code
data = TitanicSurvival

Now, I show a random set of passengers (rows in the data).

Code
set.seed(2024)
idx = sample(nrow(data), 6)
data[idx,]
survived sex age passengerClass
Richards, Master. George Sibley yes male 0.8333 2nd
Cacic, Mr. Luka no male 38.0000 3rd
Vendel, Mr. Olof Edvin no male 20.0000 3rd
Karaic, Mr. Milan no male 30.0000 3rd
Andersen, Mr. Albert Karvin no male 32.0000 3rd
Eustis, Miss. Elizabeth Mussey yes female 54.0000 1st

Data processing

This data set is very tidy and clean. We don’t need to do a lot of processing.

Histogram of age

However, it is important to get an overview of the data. I need to understand how the data looks like, for example, by plotting histograms for all continuous data variables.

Code
hist(data$age, main = "Distribution of age", xlab = "Age (years)",
     col = "#001E7C", border = "grey90")

I don’t like the label “Frequency”; change it to “Counts”.

Definition of outcome

Next, I define the primary outcome with a new variable called event. It is a binary outcome set to TRUE when the passenger survived, and FALSE otherwise.

Code
data$event = data$survived == "yes"

Missing data

A very important step is to assess and address missing data. I found that age has missing data, so I report the number of missing values in age (%).

Code
count = sum(is.na(data$age))
prc = sum(is.na(data$age))/nrow(data)*100

tab = data.frame(age = sprintf("%d (%.1f%%)", count, prc))
tab
age
263 (20.1%)

Create analysis dataset

We will do a complete case analysis. Actually, this is not really recommended; however, in this tutorial, it is fit for purpose. The analysis data set is stored in the object titanic.

How can missing data be addressed otherwise?

Code
idx = complete.cases(data)
titanic = data[idx,]

Quality control

Quality control is a crucial step and should be part of any data analysis pipeline. A test is like a question to the data that should be true, for example:

  • Do all passengers have a value in passengerClass (no missing)?
  • Are all the values in age larger than 0 and smaller than 100?

As soon as the logical expression within the function assert is not met, an error will be thrown.

Code
assert(!is.na(titanic$passengerClass))

You can now extend the code block above. You could test the other variable sex for completeness. You can try to test that the age range is between 0 and 100, for example. You can test anything.

Test the obvious.

Descriptive statistics

I report some descriptives of the Titanic passengers, but I’m lazy and just report the sample size and age.

Code
q_age = quantile(titanic$age, probs = c(0.5, 0.25, 0.75))

tab = data.frame(
  
  Descriptives = c(
    sprintf("N=%d", nrow(titanic)),
    sprintf("%0.f (%0.f--%0.f)", q_age[1], q_age[2], q_age[3])
  )
)

rownames(tab) = c("Sample Size",
                  "Age, median (IQR)")
tab
Table 1: Descriptives of the Titanic passengers.
Descriptives
Sample Size N=1046
Age, median (IQR) 28 (21–39)

You can now try to extend Table 1 and also report the count (and %) for categorical variables sex and passengerClass.

Primary analysis

We fit a logistic regression model to predict the binary outcome “survived: yes”. I also set the reference levels to my liking.

Code
titanic$sex = factor(titanic$sex, levels = c("male", "female"))
titanic$passengerClass = factor(titanic$passengerClass, levels = c("3rd", "2nd", "1st"))

fit = glm(event ~ sex + age + passengerClass, data = titanic,  family = binomial)

tab = summary(fit)$coefficients
as.data.frame.matrix(tab)
Estimate Std. Error z value Pr(>|z|)
(Intercept) -1.2654312 0.2028675 -6.237722 0e+00
sexfemale 2.4978447 0.1660365 15.043949 0e+00
age -0.0343932 0.0063310 -5.432511 1e-07
passengerClass2nd 1.0090908 0.1983625 5.087104 4e-07
passengerClass1st 2.2896606 0.2258019 10.140129 0e+00

However, this table is not looking nice and is also hard to interpret. I can explain why. Therefore, I improve this table, see Table 2. That is how I like to report a logistic model.

You can execute the lines in the code block below one by one and try to understand what is going on.

Code
tab             = as.data.frame.matrix(tab)
tab$OR          = as.character(signif(exp(tab$Estimate), digits = 3))
tab$z.value.fmt = as.character(signif(tab$`z value`, digits = 3))
tab$pvalue      = ggsurvfit::format_p(tab$`Pr(>|z|)`)

# add 95% confidence interval
ci = confint(fit, level = 0.95)
lower = as.character(signif(exp(ci[,1]), digits = 2))
upper = as.character(signif(exp(ci[,2]), digits = 2))

tab$CI = sprintf("from %s to %s", lower, upper)

tab.tidy = tab[, c("OR", "CI", "z.value.fmt", "pvalue")]
colnames(tab.tidy) = c("Odds ratio", "95% CI", "z value", "p value")
tab.tidy
Table 2: Results from a logistic regression model.
Odds ratio 95% CI z value p value
(Intercept) 0.282 from 0.19 to 0.42 -6.24 <0.001
sexfemale 12.2 from 8.8 to 17 15 <0.001
age 0.966 from 0.95 to 0.98 -5.43 <0.001
passengerClass2nd 2.74 from 1.9 to 4.1 5.09 <0.001
passengerClass1st 9.87 from 6.4 to 15 10.1 <0.001

The row names can be improved. Try it.

Now we are done. We found that on the Titanic, it was not very ideal to be a 3rd class, a male, or an older passenger. And it was certainly not ideal to be all of this (it is an additive model). I will discuss the table in more detail if you wish.

Any suggestion to improve this analysis? Also, if you have seen any error, typo, or any other mishap in this document, please let me know. I try to improve it every year.

Computing information

Code
sessionInfo()
R version 4.4.1 (2024-06-14 ucrt)
Platform: x86_64-w64-mingw32/x64
Running under: Windows 10 x64 (build 19045)

Matrix products: default


locale:
[1] LC_COLLATE=English_Switzerland.utf8  LC_CTYPE=English_Switzerland.utf8   
[3] LC_MONETARY=English_Switzerland.utf8 LC_NUMERIC=C                        
[5] LC_TIME=English_Switzerland.utf8    

time zone: Europe/Zurich
tzcode source: internal

attached base packages:
[1] stats     graphics  grDevices utils     datasets  methods   base     

other attached packages:
[1] testit_0.13   carData_3.0-5

loaded via a namespace (and not attached):
 [1] vctrs_0.6.5       cli_3.6.3         knitr_1.48        rlang_1.1.4      
 [5] xfun_0.48         generics_0.1.3    jsonlite_1.8.9    glue_1.8.0       
 [9] colorspace_2.1-1  htmltools_0.5.8.1 scales_1.3.0      fansi_1.0.6      
[13] rmarkdown_2.28    grid_4.4.1        munsell_0.5.1     evaluate_1.0.1   
[17] tibble_3.2.1      fastmap_1.2.0     yaml_2.3.10       lifecycle_1.0.4  
[21] compiler_4.4.1    ggsurvfit_1.1.0   dplyr_1.1.4       htmlwidgets_1.6.4
[25] pkgconfig_2.0.3   lattice_0.22-6    digest_0.6.37     R6_2.5.1         
[29] tidyselect_1.2.1  utf8_1.2.4        splines_4.4.1     pillar_1.9.0     
[33] magrittr_2.0.3    Matrix_1.7-0      gtable_0.3.5      tools_4.4.1      
[37] survival_3.7-0    ggplot2_3.5.1