R语言gtsummary包快速绘制汇总表

WindBlow

2023-01-18

R语言

R语言, R包

gtsummary包，主要作用是汇总数据、回归建模，然后将统计的结果以一种可自定义的、优美的格式输出出来，简化数据科学流程，极大提高数据统计效率。

示例数据

使用trial数据集。

该数据集包含200名接受药物A或药物B治疗患者基线特征，肿瘤对治疗的反应等。

## Rows: 200
## Columns: 8
## $ trt      <chr> "Drug A", "Drug B", "Drug A", "Drug A", "Drug A", "Drug B", "…
## $ age      <dbl> 23, 9, 31, NA, 51, 39, 37, 32, 31, 34, 42, 63, 54, 21, 48, 71…
## $ marker   <dbl> 0.160, 1.107, 0.277, 2.067, 2.767, 0.613, 0.354, 1.739, 0.144…
## $ stage    <fct> T1, T2, T1, T3, T4, T4, T1, T1, T1, T3, T1, T3, T4, T4, T1, T…
## $ grade    <fct> II, I, II, III, III, I, II, I, II, I, III, I, III, I, I, III,…
## $ response <int> 0, 1, 0, 1, 1, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 1, 0, 0, 0, 0…
## $ death    <int> 0, 0, 0, 1, 1, 1, 0, 1, 0, 1, 0, 0, 1, 1, 1, 1, 0, 1, 0, 0, 0…
## $ ttdeath  <dbl> 24.00, 24.00, 24.00, 17.64, 16.43, 15.64, 24.00, 18.43, 24.00…

基本用法

tbl_summary() 输出汇总统计表。

利用 trial 数据集，创建一个汇总统计表。

trial |> tbl_summary(include = c(trt, age, grade))

Characteristic	N = 200¹
Chemotherapy Treatment
Drug A	98 (49%)
Drug B	102 (51%)
Age	47 (38, 57)
Unknown	11
Grade
I	68 (34%)
II	68 (34%)
III	64 (32%)
¹ n (%); Median (Q1, Q3)

对trial数据进行汇总统计，比较不同治疗下的肿瘤反应，只需简单的几行代码。

trial |>
  tbl_summary(by = trt, include = c(age, grade)) |>
  add_p()

Characteristic	Drug A N = 98¹	Drug B N = 102¹	p-value²
Age	46 (37, 60)	48 (39, 56)	0.7
Unknown	7	4
Grade			0.9
I	35 (36%)	33 (32%)
II	32 (33%)	36 (35%)
III	31 (32%)	33 (32%)
¹ Median (Q1, Q3); n (%)
² Wilcoxon rank sum test; Pearson’s Chi-squared test

自定义格式输出

有三种主要方式来自定义汇总表的输出。

使用 tbl_summary()中的自定义参数.
通过 add_*() 向汇总表中添加额外的信息.
使用 gtsummary 包中的函数修改表格的外观.

使用中的自定义参数

tbl_summary() 包含许多用于修改外观的输入选项。

参数	说明
label	指定在表中打印的变量标签
type	指定变量类型（例如，连续型、分类型等）
statistic	更改显示的汇总统计量
digits	汇总统计量四舍五入的小数位数
missing	是否显示缺失值数量的行
missing_text	缺失值行的文本标签
missing_stat	缺失值行显示的统计量
sort	按频率更改分类水平的排序
percent	打印列百分比、行百分比或单元格百分比
include	要包含在汇总表中的变量列表

例：

trial |>
  tbl_summary(
    by = trt,
    include = c(age, grade),
    statistic = list(
      all_continuous() ~ "{mean} ({sd})",
      all_categorical() ~ "{n} / {N} ({p}%)"
    ),
    digits = all_continuous() ~ 2,
    label = list(age ~ "Patient Age", grade = "Tumor Grade"),
    missing_text = "(Missing)"
  )

Characteristic	Drug A N = 98¹	Drug B N = 102¹
Patient Age	47.01 (14.71)	47.45 (14.01)
(Missing)	7	4
Tumor Grade
I	35 / 98 (36%)	33 / 102 (32%)
II	32 / 98 (33%)	36 / 102 (35%)
III	31 / 98 (32%)	33 / 102 (32%)
¹ Mean (SD); n / N (%)

有多种方式来指定 statistic=参数，可以使用单个公式、公式列表或命名列表。下表展示了为连续变量 age和marker指定均值统计量的等效方法。任何接受公式的gtsummary包中的函数参数都支持这些方式。

使用辅助选择	按变量名选择	使用命名列表选择
all_continuous() ~ “{mean}”	c(“age”, “marker”) ~ “{mean}”	list(age = “{mean}”, marker = “{mean}”)
list(all_continuous() ~ “{mean}”)	c(age, marker) ~ “{mean}”	—
—	list(c(age, marker) ~ “{mean}”)

使用辅助函数

在为 gtsummary 参数选择变量时除了直接输入数据集中的具体变量外，还可以使用所有tidyselect包中的辅助函数，如 starts_with()、contains() 和 everything()（即 dplyr::select() 中使用的辅助函数），都可用于 gtsummary包。

例如，对所有连续变量报告均值和标准差时，可以使用 statistic = all_continuous() ~ “{mean} ({sd})”。
将 age 和 marker 水平显示为一位小数，可以传递 digits = c(age, marker) ~ 1，也可以传递带引号的列名。

向表中添加额外信息

通过add_*() 添加信息或统计数据的函数。

函数	描述
add_p()	向输出中添加比较组间数值的 p 值
add_overall()	添加包含总体汇总统计的列
add_n()	为每个变量添加包含样本量 N（或缺失值数量 N）的列
add_difference()	添加两组之间差异的列，包括置信区间和 p 值
add_stat_label()	为每行显示的汇总统计添加标签
add_stat()	通用函数，添加包含用户定义值的列
add_q()	添加 q 值列，用于多重比较的控制
add_ci()	添加置信区间

例：比较两种化疗治疗间的肿瘤反应和标志物水平的差异。

trial |> 
  tbl_summary(
    by = trt,
    include = c(response, marker),
    statistic = list(
      all_continuous() ~ "{mean} ({sd})",
      all_categorical() ~ "{p}%"
    ),
    missing = "no"
  ) |> 
  add_difference() |> 
  add_n() |> 
  modify_header(all_stat_cols() ~ "**{level}**")

Characteristic	N	Drug A¹	Drug B¹	Difference²	95% CI²	p-value²
Tumor Response	193	29%	34%	-4.2%	-18%, 9.9%	0.6
Marker Level (ng/mL)	190	1.02 (0.89)	0.82 (0.83)	0.20	-0.05, 0.44	0.12
Abbreviation: CI = Confidence Interval
¹ %; Mean (SD)
² 2-sample test for equality of proportions with continuity correction; Welch Two Sample t-test

配对 t 检验和 McNemar 检验。数据应转变为长格式。

# 假设每位患者都接受了药物 A 和药物 B，先对数据处理
trial_paired <-
  trial |> 
  select(trt, marker, response) |> 
  mutate(.by = trt, id = dplyr::row_number()) |>  # 添加ID配对
  tidyr::drop_na() |> 
  dplyr::filter(.by = id, dplyr::n() == 2)
glimpse(trial_paired)

## Rows: 166
## Columns: 4
## $ trt      <chr> "Drug A", "Drug B", "Drug A", "Drug A", "Drug A", "Drug B", "…
## $ marker   <dbl> 0.160, 1.107, 0.277, 2.067, 2.767, 0.613, 0.354, 1.739, 0.144…
## $ response <int> 0, 1, 0, 1, 1, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 1, 0, 0, 0, 0…
## $ id       <int> 1, 1, 2, 3, 4, 2, 5, 6, 7, 3, 4, 5, 6, 7, 8, 9, 8, 10, 9, 10,…

trial_paired |>   
  tbl_summary(by = trt, include = -id) |> 
  add_p(
    test = list(marker ~ "paired.t.test",
                response ~ "mcnemar.test"),
    group = id
  )

Characteristic	Drug A N = 83¹	Drug B N = 83¹	p-value²
Marker Level (ng/mL)	0.82 (0.22, 1.71)	0.53 (0.17, 1.31)	0.2
Tumor Response	21 (25%)	28 (34%)	0.3
¹ Median (Q1, Q3); n (%)
² Paired t-test; McNemar’s Chi-squared test with continuity correction

添加95%置信区间

trial |> 
  tbl_summary(
    include = c(age, marker),
    statistic = all_continuous() ~ "{mean} ({sd})", 
    missing = "no"
  ) |> 
  modify_header(stat_0= "**Mean (SD)**") |> 
  remove_footnote_header(stat_0) |> 
  add_ci()

Characteristic	Mean (SD)	95% CI
Age	47 (14)	45, 49
Marker Level (ng/mL)	0.92 (0.86)	0.79, 1.0
Abbreviation: CI = Confidence Interval

修改表格的外观

gtsummary 包含专门用于修改表格外观的函数。

函数	描述
modify_header()	更新列标题
modify_footnote_header()	更新列标题脚注
modify_footnote_body()	更新表格主体脚注
modify_spanning_header()	更新跨列标题
modify_caption()	更新表格标题/说明
bold_labels()	加粗变量标签
bold_levels()	加粗变量水平
italicize_labels()	斜体变量标签
italicize_levels()	斜体变量水平
bold_p()	加粗显著的 p 值

show_header_names()函数查看能编辑的列

trial |>
  tbl_summary(by = trt, includ = c(age, grade)) |>
  add_p(pvalue_fun = label_style_pvalue(digits = 2)) |>
  add_overall() |>
  add_n() |>
  show_header_names()

## Column Name   Header                     level*          N*          n*          p*             
## label         "**Characteristic**"                       200 <int>                              
## n             "**N**"                                                                           
## stat_0        "**Overall**  \nN = 200"   Overall <chr>   200 <int>   200 <int>    1.00 <dbl>    
## stat_1        "**Drug A**  \nN = 98"      Drug A <chr>   200 <int>    98 <int>   0.490 <dbl>    
## stat_2        "**Drug B**  \nN = 102"     Drug B <chr>   200 <int>   102 <int>   0.510 <dbl>    
## p.value       "**p-value**"                              200 <int>

## * These values may be dynamically placed into headers (and other locations).
## ℹ Review the `modify_header()` (`?gtsummary::modify_header()`) help for
##   examples.

例：

trial |>
  tbl_summary(by = trt, includ = c(age, grade)) |>
  add_p(pvalue_fun = label_style_pvalue(digits = 2)) |>
  add_overall() |>
  add_n() |>
  modify_header(label ~ "**Variable**") |>
  modify_spanning_header(c("stat_1", "stat_2") ~ "**Treatment Received**") |>
  modify_footnote_header("Median (IQR) or Frequency (%)", columns = all_stat_cols()) |>
  modify_caption("**Table 1. Patient Characteristics**") |>
  bold_labels()

**Table 1. Patient Characteristics**
Variable	N	Overall N = 200¹	Treatment Received		p-value²
Variable	N	Overall N = 200¹	Drug A N = 98¹	Drug B N = 102¹	p-value²
Age	189	47 (38, 57)	46 (37, 60)	48 (39, 56)	0.72
Unknown		11	7	4
Grade	200				0.87
I		68 (34%)	35 (36%)	33 (32%)
II		68 (34%)	32 (33%)	36 (35%)
III		64 (32%)	31 (32%)	33 (32%)
¹ Median (IQR) or Frequency (%)
² Wilcoxon rank sum test; Pearson’s Chi-squared test

例：trial数据，grade各组与参考组（假定I组）进行比较

small_trial <- trial |> select(grade, age, response)

# 数据汇总表
t0 <- small_trial |> 
  tbl_summary(by = grade, missing = "no") |> 
  modify_header(all_stat_cols() ~ "**{level}**")

# I, II比较
t1 <- small_trial |> 
  dplyr::filter(grade %in% c("I", "II")) |> 
  tbl_summary(by = grade, missing = "no") |> 
  add_p() |> 
  modify_header(p.value ~ "**I vs. II**") |> 
  # 隐藏摘要统计列
  modify_column_hide(all_stat_cols())

#  I, III比较
t2 <- small_trial |> 
  dplyr::filter(grade %in% c("I", "III")) |> 
  tbl_summary(by = grade, missing = "no") |> 
  add_p() |> 
  modify_header(p.value = "**I vs. III**") |> 
  modify_column_hide(all_stat_cols())

# 合并结果
tbl_merge(list(t0, t1, t2)) |> 
  modify_spanning_header(
    all_stat_cols() ~ "**Tumor Grade**",
    starts_with("p.value") ~ "**p-values**"
  )

Characteristic	Tumor Grade			p-values
Characteristic	I¹	II¹	III¹	I vs. II²	I vs. III²
Age	47 (37, 56)	49 (37, 57)	47 (38, 58)	0.7	0.5
Tumor Response	21 (31%)	19 (30%)	21 (33%)	>0.9	0.9
¹ Median (Q1, Q3); n (%)
² Wilcoxon rank sum test; Fisher’s exact test

连续变量汇总

连续变量也可以分多行进行汇总——这是某些期刊常用的格式。要将连续变量更新为多行汇总，将汇总类型设置为”continuous2”（用于两行或更多行的汇总）。

trial |>
  tbl_summary(
    by = trt,
    include = age,
    type = all_continuous() ~ "continuous2",
    statistic = all_continuous() ~ c(
      "{N_nonmiss}",
      "{median} ({p25}, {p75})",
      "{min}, {max}"
    ),
    missing = "no"
  ) |>
  add_p(pvalue_fun = label_style_pvalue(digits = 2))

Characteristic	Drug A N = 98	Drug B N = 102	p-value¹
Age			0.72
N Non-missing	91	98
Median (Q1, Q3)	46 (37, 60)	48 (39, 56)
Min, Max	6, 78	9, 83
¹ Wilcoxon rank sum test

根据一个、两个或更多分类变量对连续变量进行汇总。下面的示例展示一个按两个分类变量汇总连续变量的表格。

trial |> 
  tbl_continuous(variable = marker, by = trt, include = grade) |> 
  modify_spanning_header(all_stat_cols() ~ "**Treatment Assignment**")

Characteristic	Treatment Assignment
Characteristic	Drug A N = 98¹	Drug B N = 102¹
Grade
I	0.96 (0.23, 1.71)	1.05 (0.28, 1.50)
II	0.66 (0.30, 1.24)	0.21 (0.09, 1.08)
III	0.84 (0.16, 1.94)	0.58 (0.33, 1.63)
¹ Marker Level (ng/mL): Median (Q1, Q3)

若需按两个以上的分类变量进行汇总，可以将 tbl_continuous 与 tbl_strata结合使用构建一个按多个变量分层的汇总表。

trial |> 
  select(trt, grade, age, stage) |> 
  mutate(grade = paste("Grade", grade)) |> 
  tbl_strata(
    strata = grade,
    ~ .x |> 
      tbl_summary(by = trt, missing = "no") |> 
      modify_header(all_stat_cols() ~ "**{level}**")
  )

Characteristic	Grade I		Grade II		Grade III
Characteristic	Drug A¹	Drug B¹	Drug A¹	Drug B¹	Drug A¹	Drug B¹
Age	46 (36, 60)	48 (42, 55)	45 (31, 55)	51 (42, 58)	52 (42, 61)	45 (36, 52)
T Stage
T1	8 (23%)	9 (27%)	14 (44%)	9 (25%)	6 (19%)	7 (21%)
T2	8 (23%)	10 (30%)	8 (25%)	9 (25%)	9 (29%)	10 (30%)
T3	11 (31%)	7 (21%)	5 (16%)	6 (17%)	6 (19%)	8 (24%)
T4	8 (23%)	7 (21%)	5 (16%)	12 (33%)	10 (32%)	8 (24%)
¹ Median (Q1, Q3); n (%)

交叉表

使用 tbl_cross() 来比较数据中的两个分类变量：

trial |>
  tbl_cross(
    row = stage,
    col = trt,
    percent = "cell"
  ) |>
  add_p()

	Chemotherapy Treatment		Total	p-value¹
	Drug A	Drug B	Total	p-value¹
T Stage				0.9
T1	28 (14%)	25 (13%)	53 (27%)
T2	25 (13%)	29 (15%)	54 (27%)
T3	22 (11%)	21 (11%)	43 (22%)
T4	23 (12%)	27 (14%)	50 (25%)
Total	98 (49%)	102 (51%)	200 (100%)
¹ Pearson’s Chi-squared test

以上部分收集整理自gtsummary