flowchart TB O("Objectives") --> D("Data import") D --> P("Data processing") P --> Q("Quality control") Q --> DS("Descriptive statistics") DS --> PA("Primary analysis") PA --> C("Computing information")
Dynamic reporting with R and Quarto: reanalyzing the Titanic dataset
Statistical report
Objectives
We will learn Quarto by a practical example: a reanalysis of the Titanic data. Quarto enables you to put together content, executable code, tables, and figures into a finished document. To learn more about Quarto, see https://quarto.org.
We will focus on Quarto Documents, but you can also create other Quarto projects:
- Presentations
- Dashboards
- Websites
- Books
- Manuscripts
Structure of this document
In the analysis below, I stick to the following structure. This structure can also be seen in the table of contents on the right.
Quarto supports flowcharts using Mermaid. For more details, see Diagrams with Quarto.
Here is a somewhat more detailed Data Analysis Workflow that I’m using in my own projects.
A final word of caution.
“A $100 analysis can make a $1,000,000 study worthless” –Frank Harrell
Objective of our data analysis
To identify variables associated with passenger survival in the 1912 Titanic incidence.
Data import
Usually this step involves reading an Excel file, data cleaning, merging different data sets, reshaping data, etc. Here, it is very simple. We load the data from the package carData
. We also do quality control with the package testit
.
If you don’t have the required packages already installed, you first need to install them.
install.packages("carData")
install.packages("testit")
Code
library(carData)
library(testit)
After loading the package, you can find the data in the object TitanicSurvival
; I will rename it to data
for simplicity. I show data for some randomly selected passengers in the table below.
Code
= TitanicSurvival
data
set.seed(2024)
= sample(nrow(data), 6)
idx data[idx,]
survived | sex | age | passengerClass | |
---|---|---|---|---|
Richards, Master. George Sibley | yes | male | 0.8333 | 2nd |
Cacic, Mr. Luka | no | male | 38.0000 | 3rd |
Vendel, Mr. Olof Edvin | no | male | 20.0000 | 3rd |
Karaic, Mr. Milan | no | male | 30.0000 | 3rd |
Andersen, Mr. Albert Karvin | no | male | 32.0000 | 3rd |
Eustis, Miss. Elizabeth Mussey | yes | female | 54.0000 | 1st |
Inspect the data, what variables are available and what type of data is it (continuous, binary, categorical, etc.)?
Data processing
This data set is very tidy and clean. We don’t need to do a lot of processing.
Histogram of age
However, it is important to get an overview of the data. I need to understand what the data looks like, for example, by plotting histograms for all continuous variables. In our data, only age is continuous.
Code
hist(data$age, main = "Distribution of age", xlab = "Age (years)",
col = "#001E7C", border = "grey90")
Replace the label “Frequency” with “Counts”.
Definition of outcome
Next, I define the primary outcome with a new variable called event
. It is a binary outcome set to TRUE
when the passenger survived and FALSE
otherwise.
Code
$event = data$survived == "yes" data
Missing data
A very important step is to assess and address missing data; age
has missing data. Thus, I report the number of missing values in age (%).
Code
= sum(is.na(data$age))
count = sum(is.na(data$age))/nrow(data)*100
prc
= data.frame(age = sprintf("%d (%.1f%%)", count, prc))
tab tab
age |
---|
263 (20.1%) |
What are the implications of these missing data on the interpretation of our results?
Create analysis dataset
We will do a complete case analysis. Actually, this is not really recommended; however, in this tutorial, it is fit for purpose. The analysis data set is stored in the object titanic
.
Code
= complete.cases(data)
idx = data[idx,] titanic
How can missing data be addressed otherwise?
Quality control
Quality control is a crucial step and should be part of any data analysis pipeline. A test is like a question to the data that should be true, for example:
- Do all passengers have a value in
passengerClass
(no missing)? - Are all the values in
age
larger than 0 and smaller than 100?
As soon as the logical expression within the function assert
is not met, an error will be thrown.
Code
assert(!is.na(titanic$passengerClass))
You can now extend the code block above. You could test the variable sex
for completeness. You can try to test that the age range is between 0 and 100, for example.
You can test anything. Start by testing the obvious.
Descriptive statistics
I report some descriptives for sample size, age, and sex.
Code
= quantile(titanic$age, probs = c(0.5, 0.25, 0.75))
q_age
= data.frame(
tab
Descriptives = c(
sprintf("N=%d", nrow(titanic)),
sprintf("%0.f (%0.f--%0.f)", q_age[1], q_age[2], q_age[3]),
sprintf("%d (%0.f)", sum(titanic$sex == "female"),
sum(titanic$sex == "female")/nrow(titanic)*100), # calculate %
sprintf("%d (%0.f)", sum(titanic$sex == "male"),
sum(titanic$sex == "male")/nrow(titanic)*100) # calculate %
)
)
rownames(tab) = c("Sample Size",
"Age in years, median (IQR)",
"Sex: female (%)", "Sex: male (%)")
tab
Descriptives | |
---|---|
Sample Size | N=1046 |
Age in years, median (IQR) | 28 (21–39) |
Sex: female (%) | 388 (37) |
Sex: male (%) | 658 (63) |
You can now try to extend Table 1 and also report the number of passengers (and %) per passengerClass
.
Primary analysis
I fit a logistic regression model to predict survival indicated in the variable event
. The reference levels compare females to males and 1st/2nd class to 3rd class.
The model defaults to using the first factor level as the reference, but this can be changed.
Code
$sex = factor(titanic$sex, levels = c("male", "female"))
titanic$passengerClass = factor(titanic$passengerClass, levels = c("3rd", "2nd", "1st"))
titanic
= glm(event ~ sex + age + passengerClass, data = titanic, family = binomial)
fit
= summary(fit)$coefficients
tab as.data.frame.matrix(tab)
Estimate | Std. Error | z value | Pr(>|z|) | |
---|---|---|---|---|
(Intercept) | -1.2654312 | 0.2028675 | -6.237722 | 0e+00 |
sexfemale | 2.4978447 | 0.1660365 | 15.043949 | 0e+00 |
age | -0.0343932 | 0.0063310 | -5.432511 | 1e-07 |
passengerClass2nd | 1.0090908 | 0.1983625 | 5.087104 | 4e-07 |
passengerClass1st | 2.2896606 | 0.2258019 | 10.140129 | 0e+00 |
This table is not looking nice and is also hard to interpret. Therefore, I improve this table; see Table 2. That is how I like to report a logistic model in a paper.
Run each line in the code block below one by one, and take time to understand what each step does.
Code
= as.data.frame.matrix(tab)
tab $OR = as.character(signif(exp(tab$Estimate), digits = 3))
tab$z.value.fmt = as.character(signif(tab$`z value`, digits = 3))
tab$pvalue = ggsurvfit::format_p(tab$`Pr(>|z|)`)
tab
# add 95% confidence interval
= confint(fit, level = 0.95)
ci = as.character(signif(exp(ci[,1]), digits = 2))
lower = as.character(signif(exp(ci[,2]), digits = 2))
upper
$CI = sprintf("from %s to %s", lower, upper)
tab
= tab[, c("OR", "CI", "z.value.fmt", "pvalue")]
tab.tidy colnames(tab.tidy) = c("Odds ratio", "95% CI", "z value", "p value")
tab.tidy
Odds ratio | 95% CI | z value | p value | |
---|---|---|---|---|
(Intercept) | 0.282 | from 0.19 to 0.42 | -6.24 | <0.001 |
sexfemale | 12.2 | from 8.8 to 17 | 15 | <0.001 |
age | 0.966 | from 0.95 to 0.98 | -5.43 | <0.001 |
passengerClass2nd | 2.74 | from 1.9 to 4.1 | 5.09 | <0.001 |
passengerClass1st | 9.87 | from 6.4 to 15 | 10.1 | <0.001 |
The row names could be improved, but I think we are done.
We found that on the Titanic, females, younger passengers, and those in 1st or 2nd class had a higher chance of survival.
Any suggestion to improve this analysis? Also, if you have seen any error, typo, or any other mishap in this document, please let me know. I try to improve it every year.
The lastest version of this document can be found at https://github.com/schw4b/titanic.
Computing information
Code
sessionInfo()
R version 4.4.3 (2025-02-28 ucrt)
Platform: x86_64-w64-mingw32/x64
Running under: Windows 10 x64 (build 19045)
Matrix products: default
locale:
[1] LC_COLLATE=English_Switzerland.utf8 LC_CTYPE=English_Switzerland.utf8
[3] LC_MONETARY=English_Switzerland.utf8 LC_NUMERIC=C
[5] LC_TIME=English_Switzerland.utf8
time zone: Europe/Zurich
tzcode source: internal
attached base packages:
[1] stats graphics grDevices utils datasets methods base
other attached packages:
[1] testit_0.13 carData_3.0-5
loaded via a namespace (and not attached):
[1] vctrs_0.6.5 cli_3.6.4 knitr_1.50 rlang_1.1.5
[5] xfun_0.51 generics_0.1.3 jsonlite_1.9.1 glue_1.8.0
[9] colorspace_2.1-1 htmltools_0.5.8.1 scales_1.3.0 rmarkdown_2.29
[13] grid_4.4.3 munsell_0.5.1 evaluate_1.0.3 tibble_3.2.1
[17] fastmap_1.2.0 yaml_2.3.10 lifecycle_1.0.4 compiler_4.4.3
[21] ggsurvfit_1.1.0 dplyr_1.1.4 htmlwidgets_1.6.4 pkgconfig_2.0.3
[25] lattice_0.22-6 digest_0.6.37 R6_2.6.1 tidyselect_1.2.1
[29] splines_4.4.3 pillar_1.10.1 magrittr_2.0.3 Matrix_1.7-3
[33] tools_4.4.3 gtable_0.3.6 survival_3.8-3 ggplot2_3.5.1