R语言gtsummary包快速绘制汇总表

0

gtsummary包,主要作用是汇总数据、回归建模,然后将统计的结果以一种可自定义的、优美的格式输出出来,简化数据科学流程,极大提高数据统计效率。

示例数据

使用trial数据集。

该数据集包含200名接受药物A或药物B治疗患者基线特征,肿瘤对治疗的反应等。

## Rows: 200
## Columns: 8
## $ trt      <chr> "Drug A", "Drug B", "Drug A", "Drug A", "Drug A", "Drug B", "…
## $ age      <dbl> 23, 9, 31, NA, 51, 39, 37, 32, 31, 34, 42, 63, 54, 21, 48, 71…
## $ marker   <dbl> 0.160, 1.107, 0.277, 2.067, 2.767, 0.613, 0.354, 1.739, 0.144…
## $ stage    <fct> T1, T2, T1, T3, T4, T4, T1, T1, T1, T3, T1, T3, T4, T4, T1, T…
## $ grade    <fct> II, I, II, III, III, I, II, I, II, I, III, I, III, I, I, III,…
## $ response <int> 0, 1, 0, 1, 1, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 1, 0, 0, 0, 0…
## $ death    <int> 0, 0, 0, 1, 1, 1, 0, 1, 0, 1, 0, 0, 1, 1, 1, 1, 0, 1, 0, 0, 0…
## $ ttdeath  <dbl> 24.00, 24.00, 24.00, 17.64, 16.43, 15.64, 24.00, 18.43, 24.00…

基本用法

tbl_summary() 输出汇总统计表。

利用 trial 数据集,创建一个汇总统计表。

trial |> tbl_summary(include = c(trt, age, grade))

CharacteristicN = 2001
Chemotherapy Treatment
    Drug A98 (49%)
    Drug B102 (51%)
Age47 (38, 57)
    Unknown11
Grade
    I68 (34%)
    II68 (34%)
    III64 (32%)
1 n (%); Median (Q1, Q3)

对trial数据进行汇总统计,比较不同治疗下的肿瘤反应,只需简单的几行代码。

trial |>
  tbl_summary(by = trt, include = c(age, grade)) |>
  add_p()

CharacteristicDrug A
N = 98
1
Drug B
N = 102
1
p-value2
Age46 (37, 60)48 (39, 56)0.7
    Unknown74
Grade

0.9
    I35 (36%)33 (32%)
    II32 (33%)36 (35%)
    III31 (32%)33 (32%)
1 Median (Q1, Q3); n (%)
2 Wilcoxon rank sum test; Pearson’s Chi-squared test

自定义格式输出

有三种主要方式来自定义汇总表的输出。

  • 使用 tbl_summary()中的自定义参数.
  • 通过 add_*() 向汇总表中添加额外的信息.
  • 使用 gtsummary 包中的函数修改表格的外观.

使用中的自定义参数

tbl_summary() 包含许多用于修改外观的输入选项。

参数 说明
label 指定在表中打印的变量标签
type 指定变量类型(例如,连续型、分类型等)
statistic 更改显示的汇总统计量
digits 汇总统计量四舍五入的小数位数
missing 是否显示缺失值数量的行
missing_text 缺失值行的文本标签
missing_stat 缺失值行显示的统计量
sort 按频率更改分类水平的排序
percent 打印列百分比、行百分比或单元格百分比
include 要包含在汇总表中的变量列表

例:

trial |>
  tbl_summary(
    by = trt,
    include = c(age, grade),
    statistic = list(
      all_continuous() ~ "{mean} ({sd})",
      all_categorical() ~ "{n} / {N} ({p}%)"
    ),
    digits = all_continuous() ~ 2,
    label = list(age ~ "Patient Age", grade = "Tumor Grade"),
    missing_text = "(Missing)"
  )

CharacteristicDrug A
N = 98
1
Drug B
N = 102
1
Patient Age47.01 (14.71)47.45 (14.01)
    (Missing)74
Tumor Grade

    I35 / 98 (36%)33 / 102 (32%)
    II32 / 98 (33%)36 / 102 (35%)
    III31 / 98 (32%)33 / 102 (32%)
1 Mean (SD); n / N (%)

有多种方式来指定 statistic=参数,可以使用单个公式、公式列表或命名列表。下表展示了为连续变量 age和marker指定均值统计量的等效方法。任何接受公式的gtsummary包中的函数参数都支持这些方式。

使用辅助选择 按变量名选择 使用命名列表选择
all_continuous() ~ “{mean}” c(“age”, “marker”) ~ “{mean}” list(age = “{mean}”, marker = “{mean}”)
list(all_continuous() ~ “{mean}”) c(age, marker) ~ “{mean}”
list(c(age, marker) ~ “{mean}”)

使用辅助函数

在为 gtsummary 参数选择变量时除了直接输入数据集中的具体变量外,还可以使用所有tidyselect包中的辅助函数,如 starts_with()、contains() 和 everything()(即 dplyr::select() 中使用的辅助函数),都可用于 gtsummary包。

例如,对所有连续变量报告均值和标准差时,可以使用 statistic = all_continuous() ~ “{mean} ({sd})”。
将 age 和 marker 水平显示为一位小数,可以传递 digits = c(age, marker) ~ 1,也可以传递带引号的列名。

向表中添加额外信息

通过add_*() 添加信息或统计数据的函数。

函数 描述
add_p() 向输出中添加比较组间数值的 p 值
add_overall() 添加包含总体汇总统计的列
add_n() 为每个变量添加包含样本量 N(或缺失值数量 N)的列
add_difference() 添加两组之间差异的列,包括置信区间和 p 值
add_stat_label() 为每行显示的汇总统计添加标签
add_stat() 通用函数,添加包含用户定义值的列
add_q() 添加 q 值列,用于多重比较的控制
add_ci() 添加置信区间

例:比较两种化疗治疗间的肿瘤反应和标志物水平的差异。

trial |> 
  tbl_summary(
    by = trt,
    include = c(response, marker),
    statistic = list(
      all_continuous() ~ "{mean} ({sd})",
      all_categorical() ~ "{p}%"
    ),
    missing = "no"
  ) |> 
  add_difference() |> 
  add_n() |> 
  modify_header(all_stat_cols() ~ "**{level}**")

CharacteristicNDrug A1Drug B1Difference295% CI2p-value2
Tumor Response19329%34%-4.2%-18%, 9.9%0.6
Marker Level (ng/mL)1901.02 (0.89)0.82 (0.83)0.20-0.05, 0.440.12
Abbreviation: CI = Confidence Interval
1 %; Mean (SD)
2 2-sample test for equality of proportions with continuity correction; Welch Two Sample t-test

配对 t 检验和 McNemar 检验。数据应转变为长格式。

# 假设每位患者都接受了药物 A 和药物 B,先对数据处理
trial_paired <-
  trial |> 
  select(trt, marker, response) |> 
  mutate(.by = trt, id = dplyr::row_number()) |>  # 添加ID配对
  tidyr::drop_na() |> 
  dplyr::filter(.by = id, dplyr::n() == 2)
glimpse(trial_paired)
## Rows: 166
## Columns: 4
## $ trt      <chr> "Drug A", "Drug B", "Drug A", "Drug A", "Drug A", "Drug B", "…
## $ marker   <dbl> 0.160, 1.107, 0.277, 2.067, 2.767, 0.613, 0.354, 1.739, 0.144…
## $ response <int> 0, 1, 0, 1, 1, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 1, 0, 0, 0, 0…
## $ id       <int> 1, 1, 2, 3, 4, 2, 5, 6, 7, 3, 4, 5, 6, 7, 8, 9, 8, 10, 9, 10,…
trial_paired |>   
  tbl_summary(by = trt, include = -id) |> 
  add_p(
    test = list(marker ~ "paired.t.test",
                response ~ "mcnemar.test"),
    group = id
  )

CharacteristicDrug A
N = 83
1
Drug B
N = 83
1
p-value2
Marker Level (ng/mL)0.82 (0.22, 1.71)0.53 (0.17, 1.31)0.2
Tumor Response21 (25%)28 (34%)0.3
1 Median (Q1, Q3); n (%)
2 Paired t-test; McNemar’s Chi-squared test with continuity correction

添加95%置信区间

trial |> 
  tbl_summary(
    include = c(age, marker),
    statistic = all_continuous() ~ "{mean} ({sd})", 
    missing = "no"
  ) |> 
  modify_header(stat_0= "**Mean (SD)**") |> 
  remove_footnote_header(stat_0) |> 
  add_ci()

CharacteristicMean (SD)95% CI
Age47 (14)45, 49
Marker Level (ng/mL)0.92 (0.86)0.79, 1.0
Abbreviation: CI = Confidence Interval

修改表格的外观

gtsummary 包含专门用于修改表格外观的函数。

函数 描述
modify_header() 更新列标题
modify_footnote_header() 更新列标题脚注
modify_footnote_body() 更新表格主体脚注
modify_spanning_header() 更新跨列标题
modify_caption() 更新表格标题/说明
bold_labels() 加粗变量标签
bold_levels() 加粗变量水平
italicize_labels() 斜体变量标签
italicize_levels() 斜体变量水平
bold_p() 加粗显著的 p 值

show_header_names()函数查看能编辑的列

trial |>
  tbl_summary(by = trt, includ = c(age, grade)) |>
  add_p(pvalue_fun = label_style_pvalue(digits = 2)) |>
  add_overall() |>
  add_n() |>
  show_header_names()
## Column Name   Header                     level*          N*          n*          p*             
## label         "**Characteristic**"                       200 <int>                              
## n             "**N**"                                                                           
## stat_0        "**Overall**  \nN = 200"   Overall <chr>   200 <int>   200 <int>    1.00 <dbl>    
## stat_1        "**Drug A**  \nN = 98"      Drug A <chr>   200 <int>    98 <int>   0.490 <dbl>    
## stat_2        "**Drug B**  \nN = 102"     Drug B <chr>   200 <int>   102 <int>   0.510 <dbl>    
## p.value       "**p-value**"                              200 <int>

## * These values may be dynamically placed into headers (and other locations).
## ℹ Review the `modify_header()` (`?gtsummary::modify_header()`) help for
##   examples.

例:

trial |>
  tbl_summary(by = trt, includ = c(age, grade)) |>
  add_p(pvalue_fun = label_style_pvalue(digits = 2)) |>
  add_overall() |>
  add_n() |>
  modify_header(label ~ "**Variable**") |>
  modify_spanning_header(c("stat_1", "stat_2") ~ "**Treatment Received**") |>
  modify_footnote_header("Median (IQR) or Frequency (%)", columns = all_stat_cols()) |>
  modify_caption("**Table 1. Patient Characteristics**") |>
  bold_labels()

Table 1. Patient Characteristics
VariableNOverall
N = 200
1
Treatment Received
p-value2
Drug A
N = 98
1
Drug B
N = 102
1
Age18947 (38, 57)46 (37, 60)48 (39, 56)0.72
    Unknown
1174
Grade200


0.87
    I
68 (34%)35 (36%)33 (32%)
    II
68 (34%)32 (33%)36 (35%)
    III
64 (32%)31 (32%)33 (32%)
1 Median (IQR) or Frequency (%)
2 Wilcoxon rank sum test; Pearson’s Chi-squared test

例:trial数据,grade各组与参考组(假定I组)进行比较

small_trial <- trial |> select(grade, age, response)

# 数据汇总表
t0 <- small_trial |> 
  tbl_summary(by = grade, missing = "no") |> 
  modify_header(all_stat_cols() ~ "**{level}**")

# I, II比较
t1 <- small_trial |> 
  dplyr::filter(grade %in% c("I", "II")) |> 
  tbl_summary(by = grade, missing = "no") |> 
  add_p() |> 
  modify_header(p.value ~ "**I vs. II**") |> 
  # 隐藏摘要统计列
  modify_column_hide(all_stat_cols())

#  I, III比较
t2 <- small_trial |> 
  dplyr::filter(grade %in% c("I", "III")) |> 
  tbl_summary(by = grade, missing = "no") |> 
  add_p() |> 
  modify_header(p.value = "**I vs. III**") |> 
  modify_column_hide(all_stat_cols())

# 合并结果
tbl_merge(list(t0, t1, t2)) |> 
  modify_spanning_header(
    all_stat_cols() ~ "**Tumor Grade**",
    starts_with("p.value") ~ "**p-values**"
  )

Characteristic
Tumor Grade
p-values
I1II1III1I vs. II2I vs. III2
Age47 (37, 56)49 (37, 57)47 (38, 58)0.70.5
Tumor Response21 (31%)19 (30%)21 (33%)>0.90.9
1 Median (Q1, Q3); n (%)
2 Wilcoxon rank sum test; Fisher’s exact test

连续变量汇总

连续变量也可以分多行进行汇总——这是某些期刊常用的格式。要将连续变量更新为多行汇总,将汇总类型设置为”continuous2”(用于两行或更多行的汇总)。

trial |>
  tbl_summary(
    by = trt,
    include = age,
    type = all_continuous() ~ "continuous2",
    statistic = all_continuous() ~ c(
      "{N_nonmiss}",
      "{median} ({p25}, {p75})",
      "{min}, {max}"
    ),
    missing = "no"
  ) |>
  add_p(pvalue_fun = label_style_pvalue(digits = 2))

CharacteristicDrug A
N = 98
Drug B
N = 102
p-value1
Age

0.72
    N Non-missing9198
    Median (Q1, Q3)46 (37, 60)48 (39, 56)
    Min, Max6, 789, 83
1 Wilcoxon rank sum test

根据一个、两个或更多分类变量对连续变量进行汇总。下面的示例展示一个按两个分类变量汇总连续变量的表格。

trial |> 
  tbl_continuous(variable = marker, by = trt, include = grade) |> 
  modify_spanning_header(all_stat_cols() ~ "**Treatment Assignment**")

Characteristic
Treatment Assignment
Drug A
N = 98
1
Drug B
N = 102
1
Grade

    I0.96 (0.23, 1.71)1.05 (0.28, 1.50)
    II0.66 (0.30, 1.24)0.21 (0.09, 1.08)
    III0.84 (0.16, 1.94)0.58 (0.33, 1.63)
1 Marker Level (ng/mL): Median (Q1, Q3)

若需按两个以上的分类变量进行汇总,可以将 tbl_continuous 与 tbl_strata结合使用构建一个按多个变量分层的汇总表。

trial |> 
  select(trt, grade, age, stage) |> 
  mutate(grade = paste("Grade", grade)) |> 
  tbl_strata(
    strata = grade,
    ~ .x |> 
      tbl_summary(by = trt, missing = "no") |> 
      modify_header(all_stat_cols() ~ "**{level}**")
  )

Characteristic
Grade I
Grade II
Grade III
Drug A1Drug B1Drug A1Drug B1Drug A1Drug B1
Age46 (36, 60)48 (42, 55)45 (31, 55)51 (42, 58)52 (42, 61)45 (36, 52)
T Stage





    T18 (23%)9 (27%)14 (44%)9 (25%)6 (19%)7 (21%)
    T28 (23%)10 (30%)8 (25%)9 (25%)9 (29%)10 (30%)
    T311 (31%)7 (21%)5 (16%)6 (17%)6 (19%)8 (24%)
    T48 (23%)7 (21%)5 (16%)12 (33%)10 (32%)8 (24%)
1 Median (Q1, Q3); n (%)

交叉表

使用 tbl_cross() 来比较数据中的两个分类变量:

trial |>
  tbl_cross(
    row = stage,
    col = trt,
    percent = "cell"
  ) |>
  add_p()

Chemotherapy Treatment
Totalp-value1
Drug ADrug B
T Stage


0.9
    T128 (14%)25 (13%)53 (27%)
    T225 (13%)29 (15%)54 (27%)
    T322 (11%)21 (11%)43 (22%)
    T423 (12%)27 (14%)50 (25%)
Total98 (49%)102 (51%)200 (100%)
1 Pearson’s Chi-squared test

以上部分收集整理自gtsummary