Dynamic reporting with R and Quarto: reanalyzing the Titanic dataset

Statistical report

Author

Simon Schwab

Published

March 27, 2025

Abstract
An introduction to Quarto for participants in the Good Research Practice (GRP) course at the University of Zurich.

Objectives

We will learn Quarto by a practical example: a reanalysis of the Titanic data. Quarto enables you to put together content, executable code, tables, and figures into a finished document. To learn more about Quarto, see https://quarto.org.

We will focus on Quarto Documents, but you can also create other Quarto projects:

  • Presentations
  • Dashboards
  • Websites
  • Books
  • Manuscripts

Structure of this document

In the analysis below, I stick to the following structure. This structure can also be seen in the table of contents on the right.

flowchart TB
O("Objectives") --> D("Data import")
D --> P("Data processing")
P --> Q("Quality control")
Q --> DS("Descriptive statistics")
DS --> PA("Primary analysis")
PA --> C("Computing information")

Quarto supports flowcharts using Mermaid. For more details, see Diagrams with Quarto.

Here is a somewhat more detailed Data Analysis Workflow that I’m using in my own projects.

A final word of caution.

“A $100 analysis can make a $1,000,000 study worthless” –Frank Harrell

Objective of our data analysis

To identify variables associated with passenger survival in the 1912 Titanic incidence.

Data import

Usually this step involves reading an Excel file, data cleaning, merging different data sets, reshaping data, etc. Here, it is very simple. We load the data from the package carData. We also do quality control with the package testit.

If you don’t have the required packages already installed, you first need to install them.

install.packages("carData")
install.packages("testit")
Code
library(carData)
library(testit)

After loading the package, you can find the data in the object TitanicSurvival; I will rename it to data for simplicity. I show data for some randomly selected passengers in the table below.

Code
data = TitanicSurvival

set.seed(2024)
idx = sample(nrow(data), 6)
data[idx,]
survived sex age passengerClass
Richards, Master. George Sibley yes male 0.8333 2nd
Cacic, Mr. Luka no male 38.0000 3rd
Vendel, Mr. Olof Edvin no male 20.0000 3rd
Karaic, Mr. Milan no male 30.0000 3rd
Andersen, Mr. Albert Karvin no male 32.0000 3rd
Eustis, Miss. Elizabeth Mussey yes female 54.0000 1st
Exercise 1

Inspect the data, what variables are available and what type of data is it (continuous, binary, categorical, etc.)?

Data processing

This data set is very tidy and clean. We don’t need to do a lot of processing.

Histogram of age

However, it is important to get an overview of the data. I need to understand what the data looks like, for example, by plotting histograms for all continuous variables. In our data, only age is continuous.

Code
hist(data$age, main = "Distribution of age", xlab = "Age (years)",
     col = "#001E7C", border = "grey90")

Exercise 2

Replace the label “Frequency” with “Counts”.

Definition of outcome

Next, I define the primary outcome with a new variable called event. It is a binary outcome set to TRUE when the passenger survived and FALSE otherwise.

Code
data$event = data$survived == "yes"

Missing data

A very important step is to assess and address missing data; age has missing data. Thus, I report the number of missing values in age (%).

Code
count = sum(is.na(data$age))
prc = sum(is.na(data$age))/nrow(data)*100

tab = data.frame(age = sprintf("%d (%.1f%%)", count, prc))
tab
age
263 (20.1%)
Exercise 3

What are the implications of these missing data on the interpretation of our results?

Create analysis dataset

We will do a complete case analysis. Actually, this is not really recommended; however, in this tutorial, it is fit for purpose. The analysis data set is stored in the object titanic.

Code
idx = complete.cases(data)
titanic = data[idx,]
Exercise 4

How can missing data be addressed otherwise?

Quality control

Quality control is a crucial step and should be part of any data analysis pipeline. A test is like a question to the data that should be true, for example:

  • Do all passengers have a value in passengerClass (no missing)?
  • Are all the values in age larger than 0 and smaller than 100?

As soon as the logical expression within the function assert is not met, an error will be thrown.

Code
assert(!is.na(titanic$passengerClass))
Exercise 5

You can now extend the code block above. You could test the variable sex for completeness. You can try to test that the age range is between 0 and 100, for example.

You can test anything. Start by testing the obvious.

Descriptive statistics

I report some descriptives for sample size, age, and sex.

Code
q_age = quantile(titanic$age, probs = c(0.5, 0.25, 0.75))

tab = data.frame(
  
  Descriptives = c(
    sprintf("N=%d", nrow(titanic)),
    sprintf("%0.f (%0.f--%0.f)", q_age[1], q_age[2], q_age[3]),
    sprintf("%d (%0.f)", sum(titanic$sex == "female"), 
            sum(titanic$sex == "female")/nrow(titanic)*100), # calculate %
    sprintf("%d (%0.f)", sum(titanic$sex == "male"), 
            sum(titanic$sex == "male")/nrow(titanic)*100) # calculate %
  )
)

rownames(tab) = c("Sample Size",
                  "Age in years, median (IQR)",
                  "Sex: female (%)", "Sex: male (%)")
tab
Table 1: Descriptives of the Titanic passengers.
Descriptives
Sample Size N=1046
Age in years, median (IQR) 28 (21–39)
Sex: female (%) 388 (37)
Sex: male (%) 658 (63)
Exercise 6

You can now try to extend Table 1 and also report the number of passengers (and %) per passengerClass.

Primary analysis

I fit a logistic regression model to predict survival indicated in the variable event. The reference levels compare females to males and 1st/2nd class to 3rd class.

The model defaults to using the first factor level as the reference, but this can be changed.

Code
titanic$sex = factor(titanic$sex, levels = c("male", "female"))
titanic$passengerClass = factor(titanic$passengerClass, levels = c("3rd", "2nd", "1st"))

fit = glm(event ~ sex + age + passengerClass, data = titanic,  family = binomial)

tab = summary(fit)$coefficients
as.data.frame.matrix(tab)
Estimate Std. Error z value Pr(>|z|)
(Intercept) -1.2654312 0.2028675 -6.237722 0e+00
sexfemale 2.4978447 0.1660365 15.043949 0e+00
age -0.0343932 0.0063310 -5.432511 1e-07
passengerClass2nd 1.0090908 0.1983625 5.087104 4e-07
passengerClass1st 2.2896606 0.2258019 10.140129 0e+00

This table is not looking nice and is also hard to interpret. Therefore, I improve this table; see Table 2. That is how I like to report a logistic model in a paper.

Exercise 7

Run each line in the code block below one by one, and take time to understand what each step does.

Code
tab             = as.data.frame.matrix(tab)
tab$OR          = as.character(signif(exp(tab$Estimate), digits = 3))
tab$z.value.fmt = as.character(signif(tab$`z value`, digits = 3))
tab$pvalue      = ggsurvfit::format_p(tab$`Pr(>|z|)`)

# add 95% confidence interval
ci = confint(fit, level = 0.95)
lower = as.character(signif(exp(ci[,1]), digits = 2))
upper = as.character(signif(exp(ci[,2]), digits = 2))

tab$CI = sprintf("from %s to %s", lower, upper)

tab.tidy = tab[, c("OR", "CI", "z.value.fmt", "pvalue")]
colnames(tab.tidy) = c("Odds ratio", "95% CI", "z value", "p value")
tab.tidy
Table 2: Results from a logistic regression model.
Odds ratio 95% CI z value p value
(Intercept) 0.282 from 0.19 to 0.42 -6.24 <0.001
sexfemale 12.2 from 8.8 to 17 15 <0.001
age 0.966 from 0.95 to 0.98 -5.43 <0.001
passengerClass2nd 2.74 from 1.9 to 4.1 5.09 <0.001
passengerClass1st 9.87 from 6.4 to 15 10.1 <0.001

The row names could be improved, but I think we are done.

We found that on the Titanic, females, younger passengers, and those in 1st or 2nd class had a higher chance of survival.

Any suggestion to improve this analysis? Also, if you have seen any error, typo, or any other mishap in this document, please let me know. I try to improve it every year.

The lastest version of this document can be found at https://github.com/schw4b/titanic.

Computing information

Code
sessionInfo()
R version 4.4.3 (2025-02-28 ucrt)
Platform: x86_64-w64-mingw32/x64
Running under: Windows 10 x64 (build 19045)

Matrix products: default


locale:
[1] LC_COLLATE=English_Switzerland.utf8  LC_CTYPE=English_Switzerland.utf8   
[3] LC_MONETARY=English_Switzerland.utf8 LC_NUMERIC=C                        
[5] LC_TIME=English_Switzerland.utf8    

time zone: Europe/Zurich
tzcode source: internal

attached base packages:
[1] stats     graphics  grDevices utils     datasets  methods   base     

other attached packages:
[1] testit_0.13   carData_3.0-5

loaded via a namespace (and not attached):
 [1] vctrs_0.6.5       cli_3.6.4         knitr_1.50        rlang_1.1.5      
 [5] xfun_0.51         generics_0.1.3    jsonlite_1.9.1    glue_1.8.0       
 [9] colorspace_2.1-1  htmltools_0.5.8.1 scales_1.3.0      rmarkdown_2.29   
[13] grid_4.4.3        munsell_0.5.1     evaluate_1.0.3    tibble_3.2.1     
[17] fastmap_1.2.0     yaml_2.3.10       lifecycle_1.0.4   compiler_4.4.3   
[21] ggsurvfit_1.1.0   dplyr_1.1.4       htmlwidgets_1.6.4 pkgconfig_2.0.3  
[25] lattice_0.22-6    digest_0.6.37     R6_2.6.1          tidyselect_1.2.1 
[29] splines_4.4.3     pillar_1.10.1     magrittr_2.0.3    Matrix_1.7-3     
[33] tools_4.4.3       gtable_0.3.6      survival_3.8-3    ggplot2_3.5.1