Dynamic reporting with R and Quarto: reanalyzing the Titanic dataset

Statistical report

Author

Simon Schwab

Published

March 27, 2025

Abstract

An introduction to Quarto for participants in the Good Research Practice (GRP) course at the University of Zurich.

Objectives

We will learn Quarto by a practical example: a reanalysis of the Titanic data. Quarto enables you to put together content, executable code, tables, and figures into a finished document. To learn more about Quarto, see https://quarto.org.

We will focus on Quarto Documents, but you can also create other Quarto projects:

Presentations
Dashboards
Websites
Books
Manuscripts

Structure of this document

In the analysis below, I stick to the following structure. This structure can also be seen in the table of contents on the right.

flowchart TB
O("Objectives") --> D("Data import")
D --> P("Data processing")
P --> Q("Quality control")
Q --> DS("Descriptive statistics")
DS --> PA("Primary analysis")
PA --> C("Computing information")

Quarto supports flowcharts using Mermaid. For more details, see Diagrams with Quarto.

Here is a somewhat more detailed Data Analysis Workflow that I’m using in my own projects.

A final word of caution.

“A $100 analysis can make a $1,000,000 study worthless” –Frank Harrell

Objective of our data analysis

To identify variables associated with passenger survival in the 1912 Titanic incidence.

Data import

Usually this step involves reading an Excel file, data cleaning, merging different data sets, reshaping data, etc. Here, it is very simple. We load the data from the package carData. We also do quality control with the package testit.

If you don’t have the required packages already installed, you first need to install them.

install.packages("carData")
install.packages("testit")

Code

library(carData)
library(testit)

After loading the package, you can find the data in the object TitanicSurvival; I will rename it to data for simplicity. I show data for some randomly selected passengers in the table below.

Code

data = TitanicSurvival

set.seed(2024)
idx = sample(nrow(data), 6)
data[idx,]

	survived	sex	age	passengerClass
Richards, Master. George Sibley	yes	male	0.8333	2nd
Cacic, Mr. Luka	no	male	38.0000	3rd
Vendel, Mr. Olof Edvin	no	male	20.0000	3rd
Karaic, Mr. Milan	no	male	30.0000	3rd
Andersen, Mr. Albert Karvin	no	male	32.0000	3rd
Eustis, Miss. Elizabeth Mussey	yes	female	54.0000	1st

Exercise 1

Inspect the data, what variables are available and what type of data is it (continuous, binary, categorical, etc.)?

Data processing

This data set is very tidy and clean. We don’t need to do a lot of processing.

Histogram of age

However, it is important to get an overview of the data. I need to understand what the data looks like, for example, by plotting histograms for all continuous variables. In our data, only age is continuous.

Code

hist(data$age, main = "Distribution of age", xlab = "Age (years)",
     col = "#001E7C", border = "grey90")

Exercise 2

Replace the label “Frequency” with “Counts”.

Definition of outcome

Next, I define the primary outcome with a new variable called event. It is a binary outcome set to TRUE when the passenger survived and FALSE otherwise.

Code

data$event = data$survived == "yes"

Missing data

A very important step is to assess and address missing data; age has missing data. Thus, I report the number of missing values in age (%).

Code

count = sum(is.na(data$age))
prc = sum(is.na(data$age))/nrow(data)*100

tab = data.frame(age = sprintf("%d (%.1f%%)", count, prc))
tab

age
263 (20.1%)

Exercise 3

What are the implications of these missing data on the interpretation of our results?

Create analysis dataset

We will do a complete case analysis. Actually, this is not really recommended; however, in this tutorial, it is fit for purpose. The analysis data set is stored in the object titanic.

Code

idx = complete.cases(data)
titanic = data[idx,]

Exercise 4

How can missing data be addressed otherwise?

Quality control

Quality control is a crucial step and should be part of any data analysis pipeline. A test is like a question to the data that should be true, for example:

Do all passengers have a value in passengerClass (no missing)?
Are all the values in age larger than 0 and smaller than 100?

As soon as the logical expression within the function assert is not met, an error will be thrown.

Code

assert(!is.na(titanic$passengerClass))

Exercise 5

You can now extend the code block above. You could test the variable sex for completeness. You can try to test that the age range is between 0 and 100, for example.

You can test anything. Start by testing the obvious.

Descriptive statistics

I report some descriptives for sample size, age, and sex.

Code

q_age = quantile(titanic$age, probs = c(0.5, 0.25, 0.75))

tab = data.frame(
  
  Descriptives = c(
    sprintf("N=%d", nrow(titanic)),
    sprintf("%0.f (%0.f--%0.f)", q_age[1], q_age[2], q_age[3]),
    sprintf("%d (%0.f)", sum(titanic$sex == "female"), 
            sum(titanic$sex == "female")/nrow(titanic)*100), # calculate %
    sprintf("%d (%0.f)", sum(titanic$sex == "male"), 
            sum(titanic$sex == "male")/nrow(titanic)*100) # calculate %
  )
)

rownames(tab) = c("Sample Size",
                  "Age in years, median (IQR)",
                  "Sex: female (%)", "Sex: male (%)")
tab

Table 1: Descriptives of the Titanic passengers.
	Descriptives
Sample Size	N=1046
Age in years, median (IQR)	28 (21–39)
Sex: female (%)	388 (37)
Sex: male (%)	658 (63)

Exercise 6

You can now try to extend Table 1 and also report the number of passengers (and %) per passengerClass.

Primary analysis

I fit a logistic regression model to predict survival indicated in the variable event. The reference levels compare females to males and 1st/2nd class to 3rd class.

The model defaults to using the first factor level as the reference, but this can be changed.

Code

titanic$sex = factor(titanic$sex, levels = c("male", "female"))
titanic$passengerClass = factor(titanic$passengerClass, levels = c("3rd", "2nd", "1st"))

fit = glm(event ~ sex + age + passengerClass, data = titanic,  family = binomial)

tab = summary(fit)$coefficients
as.data.frame.matrix(tab)

	Estimate	Std. Error	z value	Pr(>\|z\|)
(Intercept)	-1.2654312	0.2028675	-6.237722	0e+00
sexfemale	2.4978447	0.1660365	15.043949	0e+00
age	-0.0343932	0.0063310	-5.432511	1e-07
passengerClass2nd	1.0090908	0.1983625	5.087104	4e-07
passengerClass1st	2.2896606	0.2258019	10.140129	0e+00

This table is not looking nice and is also hard to interpret. Therefore, I improve this table; see Table 2. That is how I like to report a logistic model in a paper.

Exercise 7

Run each line in the code block below one by one, and take time to understand what each step does.

Code

tab             = as.data.frame.matrix(tab)
tab$OR          = as.character(signif(exp(tab$Estimate), digits = 3))
tab$z.value.fmt = as.character(signif(tab$`z value`, digits = 3))
tab$pvalue      = ggsurvfit::format_p(tab$`Pr(>|z|)`)

# add 95% confidence interval
ci = confint(fit, level = 0.95)
lower = as.character(signif(exp(ci[,1]), digits = 2))
upper = as.character(signif(exp(ci[,2]), digits = 2))

tab$CI = sprintf("from %s to %s", lower, upper)

tab.tidy = tab[, c("OR", "CI", "z.value.fmt", "pvalue")]
colnames(tab.tidy) = c("Odds ratio", "95% CI", "z value", "p value")
tab.tidy

Table 2: Results from a logistic regression model.
	Odds ratio	95% CI	z value	p value
(Intercept)	0.282	from 0.19 to 0.42	-6.24	<0.001
sexfemale	12.2	from 8.8 to 17	15	<0.001
age	0.966	from 0.95 to 0.98	-5.43	<0.001
passengerClass2nd	2.74	from 1.9 to 4.1	5.09	<0.001
passengerClass1st	9.87	from 6.4 to 15	10.1	<0.001

The row names could be improved, but I think we are done.

We found that on the Titanic, females, younger passengers, and those in 1st or 2nd class had a higher chance of survival.

Any suggestion to improve this analysis? Also, if you have seen any error, typo, or any other mishap in this document, please let me know. I try to improve it every year.

The lastest version of this document can be found at https://github.com/schw4b/titanic.

Computing information

Code

sessionInfo()

R version 4.4.3 (2025-02-28 ucrt)
Platform: x86_64-w64-mingw32/x64
Running under: Windows 10 x64 (build 19045)

Matrix products: default


locale:
[1] LC_COLLATE=English_Switzerland.utf8  LC_CTYPE=English_Switzerland.utf8   
[3] LC_MONETARY=English_Switzerland.utf8 LC_NUMERIC=C                        
[5] LC_TIME=English_Switzerland.utf8    

time zone: Europe/Zurich
tzcode source: internal

attached base packages:
[1] stats     graphics  grDevices utils     datasets  methods   base     

other attached packages:
[1] testit_0.13   carData_3.0-5

loaded via a namespace (and not attached):
 [1] vctrs_0.6.5       cli_3.6.4         knitr_1.50        rlang_1.1.5      
 [5] xfun_0.51         generics_0.1.3    jsonlite_1.9.1    glue_1.8.0       
 [9] colorspace_2.1-1  htmltools_0.5.8.1 scales_1.3.0      rmarkdown_2.29   
[13] grid_4.4.3        munsell_0.5.1     evaluate_1.0.3    tibble_3.2.1     
[17] fastmap_1.2.0     yaml_2.3.10       lifecycle_1.0.4   compiler_4.4.3   
[21] ggsurvfit_1.1.0   dplyr_1.1.4       htmlwidgets_1.6.4 pkgconfig_2.0.3  
[25] lattice_0.22-6    digest_0.6.37     R6_2.6.1          tidyselect_1.2.1 
[29] splines_4.4.3     pillar_1.10.1     magrittr_2.0.3    Matrix_1.7-3     
[33] tools_4.4.3       gtable_0.3.6      survival_3.8-3    ggplot2_3.5.1