R语言gtsummary包快速绘制汇总表
gtsummary包,主要作用是汇总数据、回归建模,然后将统计的结果以一种可自定义的、优美的格式输出出来,简化数据科学流程,极大提高数据统计效率。
示例数据
使用trial数据集。
该数据集包含200名接受药物A或药物B治疗患者基线特征,肿瘤对治疗的反应等。
## Rows: 200
## Columns: 8
## $ trt <chr> "Drug A", "Drug B", "Drug A", "Drug A", "Drug A", "Drug B", "…
## $ age <dbl> 23, 9, 31, NA, 51, 39, 37, 32, 31, 34, 42, 63, 54, 21, 48, 71…
## $ marker <dbl> 0.160, 1.107, 0.277, 2.067, 2.767, 0.613, 0.354, 1.739, 0.144…
## $ stage <fct> T1, T2, T1, T3, T4, T4, T1, T1, T1, T3, T1, T3, T4, T4, T1, T…
## $ grade <fct> II, I, II, III, III, I, II, I, II, I, III, I, III, I, I, III,…
## $ response <int> 0, 1, 0, 1, 1, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 1, 0, 0, 0, 0…
## $ death <int> 0, 0, 0, 1, 1, 1, 0, 1, 0, 1, 0, 0, 1, 1, 1, 1, 0, 1, 0, 0, 0…
## $ ttdeath <dbl> 24.00, 24.00, 24.00, 17.64, 16.43, 15.64, 24.00, 18.43, 24.00…
基本用法
tbl_summary() 输出汇总统计表。
利用 trial 数据集,创建一个汇总统计表。
trial |> tbl_summary(include = c(trt, age, grade))
Characteristic | N = 2001 |
---|---|
Chemotherapy Treatment | |
Drug A | 98 (49%) |
Drug B | 102 (51%) |
Age | 47 (38, 57) |
Unknown | 11 |
Grade | |
I | 68 (34%) |
II | 68 (34%) |
III | 64 (32%) |
1 n (%); Median (Q1, Q3) |
对trial数据进行汇总统计,比较不同治疗下的肿瘤反应,只需简单的几行代码。
trial |>
tbl_summary(by = trt, include = c(age, grade)) |>
add_p()
Characteristic | Drug A N = 981 | Drug B N = 1021 | p-value2 |
---|---|---|---|
Age | 46 (37, 60) | 48 (39, 56) | 0.7 |
Unknown | 7 | 4 | |
Grade | 0.9 | ||
I | 35 (36%) | 33 (32%) | |
II | 32 (33%) | 36 (35%) | |
III | 31 (32%) | 33 (32%) | |
1 Median (Q1, Q3); n (%) | |||
2 Wilcoxon rank sum test; Pearson’s Chi-squared test |
自定义格式输出
有三种主要方式来自定义汇总表的输出。
- 使用 tbl_summary()中的自定义参数.
- 通过 add_*() 向汇总表中添加额外的信息.
- 使用 gtsummary 包中的函数修改表格的外观.
使用中的自定义参数
tbl_summary() 包含许多用于修改外观的输入选项。
参数 | 说明 |
---|---|
label | 指定在表中打印的变量标签 |
type | 指定变量类型(例如,连续型、分类型等) |
statistic | 更改显示的汇总统计量 |
digits | 汇总统计量四舍五入的小数位数 |
missing | 是否显示缺失值数量的行 |
missing_text | 缺失值行的文本标签 |
missing_stat | 缺失值行显示的统计量 |
sort | 按频率更改分类水平的排序 |
percent | 打印列百分比、行百分比或单元格百分比 |
include | 要包含在汇总表中的变量列表 |
例:
trial |>
tbl_summary(
by = trt,
include = c(age, grade),
statistic = list(
all_continuous() ~ "{mean} ({sd})",
all_categorical() ~ "{n} / {N} ({p}%)"
),
digits = all_continuous() ~ 2,
label = list(age ~ "Patient Age", grade = "Tumor Grade"),
missing_text = "(Missing)"
)
Characteristic | Drug A N = 981 | Drug B N = 1021 |
---|---|---|
Patient Age | 47.01 (14.71) | 47.45 (14.01) |
(Missing) | 7 | 4 |
Tumor Grade | ||
I | 35 / 98 (36%) | 33 / 102 (32%) |
II | 32 / 98 (33%) | 36 / 102 (35%) |
III | 31 / 98 (32%) | 33 / 102 (32%) |
1 Mean (SD); n / N (%) |
有多种方式来指定 statistic=参数,可以使用单个公式、公式列表或命名列表。下表展示了为连续变量 age和marker指定均值统计量的等效方法。任何接受公式的gtsummary包中的函数参数都支持这些方式。
使用辅助选择 | 按变量名选择 | 使用命名列表选择 |
---|---|---|
all_continuous() ~ “{mean}” | c(“age”, “marker”) ~ “{mean}” | list(age = “{mean}”, marker = “{mean}”) |
list(all_continuous() ~ “{mean}”) | c(age, marker) ~ “{mean}” | — |
— | list(c(age, marker) ~ “{mean}”) |
使用辅助函数
在为 gtsummary 参数选择变量时除了直接输入数据集中的具体变量外,还可以使用所有tidyselect包中的辅助函数,如 starts_with()、contains() 和 everything()(即 dplyr::select() 中使用的辅助函数),都可用于 gtsummary包。
例如,对所有连续变量报告均值和标准差时,可以使用 statistic = all_continuous() ~ “{mean} ({sd})”。
将 age 和 marker 水平显示为一位小数,可以传递 digits = c(age, marker) ~ 1,也可以传递带引号的列名。
向表中添加额外信息
通过add_*() 添加信息或统计数据的函数。
函数 | 描述 |
---|---|
add_p() | 向输出中添加比较组间数值的 p 值 |
add_overall() | 添加包含总体汇总统计的列 |
add_n() | 为每个变量添加包含样本量 N(或缺失值数量 N)的列 |
add_difference() | 添加两组之间差异的列,包括置信区间和 p 值 |
add_stat_label() | 为每行显示的汇总统计添加标签 |
add_stat() | 通用函数,添加包含用户定义值的列 |
add_q() | 添加 q 值列,用于多重比较的控制 |
add_ci() | 添加置信区间 |
例:比较两种化疗治疗间的肿瘤反应和标志物水平的差异。
trial |>
tbl_summary(
by = trt,
include = c(response, marker),
statistic = list(
all_continuous() ~ "{mean} ({sd})",
all_categorical() ~ "{p}%"
),
missing = "no"
) |>
add_difference() |>
add_n() |>
modify_header(all_stat_cols() ~ "**{level}**")
Characteristic | N | Drug A1 | Drug B1 | Difference2 | 95% CI2 | p-value2 |
---|---|---|---|---|---|---|
Tumor Response | 193 | 29% | 34% | -4.2% | -18%, 9.9% | 0.6 |
Marker Level (ng/mL) | 190 | 1.02 (0.89) | 0.82 (0.83) | 0.20 | -0.05, 0.44 | 0.12 |
Abbreviation: CI = Confidence Interval | ||||||
1 %; Mean (SD) | ||||||
2 2-sample test for equality of proportions with continuity correction; Welch Two Sample t-test |
配对 t 检验和 McNemar 检验。数据应转变为长格式。
# 假设每位患者都接受了药物 A 和药物 B,先对数据处理
trial_paired <-
trial |>
select(trt, marker, response) |>
mutate(.by = trt, id = dplyr::row_number()) |> # 添加ID配对
tidyr::drop_na() |>
dplyr::filter(.by = id, dplyr::n() == 2)
glimpse(trial_paired)
## Rows: 166
## Columns: 4
## $ trt <chr> "Drug A", "Drug B", "Drug A", "Drug A", "Drug A", "Drug B", "…
## $ marker <dbl> 0.160, 1.107, 0.277, 2.067, 2.767, 0.613, 0.354, 1.739, 0.144…
## $ response <int> 0, 1, 0, 1, 1, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 1, 0, 0, 0, 0…
## $ id <int> 1, 1, 2, 3, 4, 2, 5, 6, 7, 3, 4, 5, 6, 7, 8, 9, 8, 10, 9, 10,…
trial_paired |>
tbl_summary(by = trt, include = -id) |>
add_p(
test = list(marker ~ "paired.t.test",
response ~ "mcnemar.test"),
group = id
)
Characteristic | Drug A N = 831 | Drug B N = 831 | p-value2 |
---|---|---|---|
Marker Level (ng/mL) | 0.82 (0.22, 1.71) | 0.53 (0.17, 1.31) | 0.2 |
Tumor Response | 21 (25%) | 28 (34%) | 0.3 |
1 Median (Q1, Q3); n (%) | |||
2 Paired t-test; McNemar’s Chi-squared test with continuity correction |
添加95%置信区间
trial |>
tbl_summary(
include = c(age, marker),
statistic = all_continuous() ~ "{mean} ({sd})",
missing = "no"
) |>
modify_header(stat_0= "**Mean (SD)**") |>
remove_footnote_header(stat_0) |>
add_ci()
Characteristic | Mean (SD) | 95% CI |
---|---|---|
Age | 47 (14) | 45, 49 |
Marker Level (ng/mL) | 0.92 (0.86) | 0.79, 1.0 |
Abbreviation: CI = Confidence Interval |
修改表格的外观
gtsummary 包含专门用于修改表格外观的函数。
函数 | 描述 |
---|---|
modify_header() | 更新列标题 |
modify_footnote_header() | 更新列标题脚注 |
modify_footnote_body() | 更新表格主体脚注 |
modify_spanning_header() | 更新跨列标题 |
modify_caption() | 更新表格标题/说明 |
bold_labels() | 加粗变量标签 |
bold_levels() | 加粗变量水平 |
italicize_labels() | 斜体变量标签 |
italicize_levels() | 斜体变量水平 |
bold_p() | 加粗显著的 p 值 |
show_header_names()
函数查看能编辑的列
trial |>
tbl_summary(by = trt, includ = c(age, grade)) |>
add_p(pvalue_fun = label_style_pvalue(digits = 2)) |>
add_overall() |>
add_n() |>
show_header_names()
## Column Name Header level* N* n* p*
## label "**Characteristic**" 200 <int>
## n "**N**"
## stat_0 "**Overall** \nN = 200" Overall <chr> 200 <int> 200 <int> 1.00 <dbl>
## stat_1 "**Drug A** \nN = 98" Drug A <chr> 200 <int> 98 <int> 0.490 <dbl>
## stat_2 "**Drug B** \nN = 102" Drug B <chr> 200 <int> 102 <int> 0.510 <dbl>
## p.value "**p-value**" 200 <int>
## * These values may be dynamically placed into headers (and other locations).
## ℹ Review the `modify_header()` (`?gtsummary::modify_header()`) help for
## examples.
例:
trial |>
tbl_summary(by = trt, includ = c(age, grade)) |>
add_p(pvalue_fun = label_style_pvalue(digits = 2)) |>
add_overall() |>
add_n() |>
modify_header(label ~ "**Variable**") |>
modify_spanning_header(c("stat_1", "stat_2") ~ "**Treatment Received**") |>
modify_footnote_header("Median (IQR) or Frequency (%)", columns = all_stat_cols()) |>
modify_caption("**Table 1. Patient Characteristics**") |>
bold_labels()
Variable | N | Overall N = 2001 | Treatment Received | p-value2 | |
---|---|---|---|---|---|
Drug A N = 981 | Drug B N = 1021 | ||||
Age | 189 | 47 (38, 57) | 46 (37, 60) | 48 (39, 56) | 0.72 |
Unknown | 11 | 7 | 4 | ||
Grade | 200 | 0.87 | |||
I | 68 (34%) | 35 (36%) | 33 (32%) | ||
II | 68 (34%) | 32 (33%) | 36 (35%) | ||
III | 64 (32%) | 31 (32%) | 33 (32%) | ||
1 Median (IQR) or Frequency (%) | |||||
2 Wilcoxon rank sum test; Pearson’s Chi-squared test |
例:trial数据,grade各组与参考组(假定I组)进行比较
small_trial <- trial |> select(grade, age, response)
# 数据汇总表
t0 <- small_trial |>
tbl_summary(by = grade, missing = "no") |>
modify_header(all_stat_cols() ~ "**{level}**")
# I, II比较
t1 <- small_trial |>
dplyr::filter(grade %in% c("I", "II")) |>
tbl_summary(by = grade, missing = "no") |>
add_p() |>
modify_header(p.value ~ "**I vs. II**") |>
# 隐藏摘要统计列
modify_column_hide(all_stat_cols())
# I, III比较
t2 <- small_trial |>
dplyr::filter(grade %in% c("I", "III")) |>
tbl_summary(by = grade, missing = "no") |>
add_p() |>
modify_header(p.value = "**I vs. III**") |>
modify_column_hide(all_stat_cols())
# 合并结果
tbl_merge(list(t0, t1, t2)) |>
modify_spanning_header(
all_stat_cols() ~ "**Tumor Grade**",
starts_with("p.value") ~ "**p-values**"
)
Characteristic | Tumor Grade | p-values | |||
---|---|---|---|---|---|
I1 | II1 | III1 | I vs. II2 | I vs. III2 | |
Age | 47 (37, 56) | 49 (37, 57) | 47 (38, 58) | 0.7 | 0.5 |
Tumor Response | 21 (31%) | 19 (30%) | 21 (33%) | >0.9 | 0.9 |
1 Median (Q1, Q3); n (%) | |||||
2 Wilcoxon rank sum test; Fisher’s exact test |
连续变量汇总
连续变量也可以分多行进行汇总——这是某些期刊常用的格式。要将连续变量更新为多行汇总,将汇总类型设置为”continuous2”(用于两行或更多行的汇总)。
trial |>
tbl_summary(
by = trt,
include = age,
type = all_continuous() ~ "continuous2",
statistic = all_continuous() ~ c(
"{N_nonmiss}",
"{median} ({p25}, {p75})",
"{min}, {max}"
),
missing = "no"
) |>
add_p(pvalue_fun = label_style_pvalue(digits = 2))
Characteristic | Drug A N = 98 | Drug B N = 102 | p-value1 |
---|---|---|---|
Age | 0.72 | ||
N Non-missing | 91 | 98 | |
Median (Q1, Q3) | 46 (37, 60) | 48 (39, 56) | |
Min, Max | 6, 78 | 9, 83 | |
1 Wilcoxon rank sum test |
根据一个、两个或更多分类变量对连续变量进行汇总。下面的示例展示一个按两个分类变量汇总连续变量的表格。
trial |>
tbl_continuous(variable = marker, by = trt, include = grade) |>
modify_spanning_header(all_stat_cols() ~ "**Treatment Assignment**")
Characteristic | Treatment Assignment | |
---|---|---|
Drug A N = 981 | Drug B N = 1021 | |
Grade | ||
I | 0.96 (0.23, 1.71) | 1.05 (0.28, 1.50) |
II | 0.66 (0.30, 1.24) | 0.21 (0.09, 1.08) |
III | 0.84 (0.16, 1.94) | 0.58 (0.33, 1.63) |
1 Marker Level (ng/mL): Median (Q1, Q3) |
若需按两个以上的分类变量进行汇总,可以将 tbl_continuous 与 tbl_strata结合使用构建一个按多个变量分层的汇总表。
trial |>
select(trt, grade, age, stage) |>
mutate(grade = paste("Grade", grade)) |>
tbl_strata(
strata = grade,
~ .x |>
tbl_summary(by = trt, missing = "no") |>
modify_header(all_stat_cols() ~ "**{level}**")
)
Characteristic | Grade I | Grade II | Grade III | |||
---|---|---|---|---|---|---|
Drug A1 | Drug B1 | Drug A1 | Drug B1 | Drug A1 | Drug B1 | |
Age | 46 (36, 60) | 48 (42, 55) | 45 (31, 55) | 51 (42, 58) | 52 (42, 61) | 45 (36, 52) |
T Stage | ||||||
T1 | 8 (23%) | 9 (27%) | 14 (44%) | 9 (25%) | 6 (19%) | 7 (21%) |
T2 | 8 (23%) | 10 (30%) | 8 (25%) | 9 (25%) | 9 (29%) | 10 (30%) |
T3 | 11 (31%) | 7 (21%) | 5 (16%) | 6 (17%) | 6 (19%) | 8 (24%) |
T4 | 8 (23%) | 7 (21%) | 5 (16%) | 12 (33%) | 10 (32%) | 8 (24%) |
1 Median (Q1, Q3); n (%) |
交叉表
使用 tbl_cross() 来比较数据中的两个分类变量:
trial |>
tbl_cross(
row = stage,
col = trt,
percent = "cell"
) |>
add_p()
Chemotherapy Treatment | Total | p-value1 | ||
---|---|---|---|---|
Drug A | Drug B | |||
T Stage | 0.9 | |||
T1 | 28 (14%) | 25 (13%) | 53 (27%) | |
T2 | 25 (13%) | 29 (15%) | 54 (27%) | |
T3 | 22 (11%) | 21 (11%) | 43 (22%) | |
T4 | 23 (12%) | 27 (14%) | 50 (25%) | |
Total | 98 (49%) | 102 (51%) | 200 (100%) | |
1 Pearson’s Chi-squared test |
以上部分收集整理自gtsummary