跳转至

ggplot2 数据可视化

1. 引言

在前面的课程中,我们学习了 R 的数据结构、数据处理(tidyverse)、流程控制和函数编写。现在我们来学习如何用 ggplot2 创建优雅、专业的数据可视化图表。

ggplot2 是 tidyverse 生态系统的核心包之一,基于 Leland Wilkinson 的《图形语法》(Grammar of Graphics)理论。它的核心思想是:将图形拆解为独立的图层,通过组合这些图层来构建复杂的可视化。

本节目标:掌握 ggplot2 的基本语法、常用图层类型、美学映射、分面和主题定制,能够绘制生物信息学常见的图表(散点图、箱线图、热图、火山图等)。


2. ggplot2 的图形语法

2.1 核心概念

ggplot2 将一个图形分解为以下组件:

  1. 数据(Data):要可视化的数据框
  2. 美学映射(Aesthetics, aes):数据变量到视觉属性的映射(x、y、颜色、大小等)
  3. 几何对象(Geoms):数据的几何表示(点、线、柱状图等)
  4. 统计变换(Stats):对数据的统计转换(计数、平滑等)
  5. 标度(Scales):控制美学映射的细节(颜色方案、坐标轴范围等)
  6. 坐标系统(Coordinate systems):数据到平面的映射方式
  7. 分面(Facets):将数据分组并创建多个子图
  8. 主题(Themes):控制图形的整体外观


2.2 基本语法结构

ggplot(data = <DATA>, mapping = aes(<MAPPINGS>)) +
  <GEOM_FUNCTION>() +
  <SCALE_FUNCTION>() +
  <FACET_FUNCTION>() +
  <THEME_FUNCTION>()

关键点: - 使用 + 号连接各个图层(不是管道符 %>%|>) - ggplot() 初始化图形,指定数据和全局美学映射 - 每个 geom_*() 函数添加一个几何图层


3. 第一个 ggplot2 图形

3.1 安装与加载

# 安装(如果尚未安装)
# install.packages("ggplot2")
# 或安装整个 tidyverse
# install.packages("tidyverse")

library(ggplot2)
library(dplyr)  # 数据处理


3.2 使用内置数据集 - 散点图

我们使用 ggplot2 内置的 mpg 数据集(汽车燃油效率数据)作为示例。

# 查看数据
head(mpg)

# 基础散点图:发动机排量 vs 高速公路油耗
ggplot(data = mpg, mapping = aes(x = displ, y = hwy)) +
  geom_point()

解读: - ggplot():初始化图形对象 - aes(x = displ, y = hwy):将 displ 映射到 x 轴,hwy 映射到 y 轴 - geom_point():用点表示数据


3.3 添加颜色映射

# 按车辆类型着色
ggplot(mpg, aes(x = displ, y = hwy, color = class)) +
  geom_point()

颜色自动按 class 变量分组,并生成图例。


3.4 添加大小和透明度

# 按气缸数设置点大小,添加透明度
ggplot(mpg, aes(x = displ, y = hwy, color = class, size = cyl)) +
  geom_point(alpha = 0.6)

注意: - alpha 控制透明度(0-1),可以放在 aes() 内映射变量,也可以作为固定参数 - size 映射到数值变量时,点的大小会随数值变化


4. 常用几何对象(Geoms)

4.1 散点图(geom_point)

适用于展示两个连续变量的关系。

# 生物信息学示例:基因表达散点图
set.seed(123)
gene_expr <- data.frame(
  gene = paste0("Gene", 1:100),
  sample_A = rnorm(100, mean = 5, sd = 2),
  sample_B = rnorm(100, mean = 5, sd = 2),
  significant = sample(c("Sig", "Not Sig"), 100, replace = TRUE, prob = c(0.3, 0.7))
)

ggplot(gene_expr, aes(x = sample_A, y = sample_B, color = significant)) +
  geom_point(size = 2, alpha = 0.7) +
  geom_abline(slope = 1, intercept = 0, linetype = "dashed", color = "gray50") +
  labs(title = "基因表达相关性",
       x = "Sample A (log2 expression)",
       y = "Sample B (log2 expression)") +
  theme_bw()


4.2 线图(geom_line)

适用于时间序列或有序数据。

# 示例:qPCR 时间序列
time_series <- data.frame(
  time = rep(0:10, 3),
  expression = c(1 * 2^(0:10 * 0.3),  # Gene1
                 1 * 2^(0:10 * 0.5),  # Gene2
                 1 * 2^(0:10 * 0.2)), # Gene3
  gene = rep(c("Gene1", "Gene2", "Gene3"), each = 11)
)

ggplot(time_series, aes(x = time, y = expression, color = gene)) +
  geom_line(linewidth = 1) +
  geom_point(size = 2) +
  scale_y_log10() +  # 对数坐标
  labs(title = "基因表达时间序列",
       x = "Time (hours)",
       y = "Expression (log scale)") +
  theme_minimal()


4.3 柱状图(geom_bar / geom_col)

  • geom_bar():自动计数,用于分类变量
  • geom_col():直接使用 y 值,用于已汇总数据
# geom_bar:自动计数
ggplot(mpg, aes(x = class)) +
  geom_bar(fill = "steelblue") +
  labs(title = "车辆类型分布", x = "类型", y = "数量") +
  theme_classic()
# geom_col:使用预先计算的值
pathway_counts <- data.frame(
  pathway = c("MAPK", "PI3K-AKT", "WNT", "Notch", "TGF-beta"),
  gene_count = c(45, 38, 25, 18, 22)
)

ggplot(pathway_counts, aes(x = reorder(pathway, gene_count), y = gene_count)) +
  geom_col(fill = "coral") +
  coord_flip() +  # 水平柱状图
  labs(title = "信号通路基因数量", x = NULL, y = "基因数") +
  theme_minimal()


4.4 箱线图(geom_boxplot)

展示数据分布、中位数、四分位数和异常值。

# 示例:不同组织的基因表达
tissue_expr <- data.frame(
  tissue = rep(c("Liver", "Brain", "Muscle", "Heart"), each = 50),
  expression = c(rnorm(50, 8, 2), rnorm(50, 6, 1.5), 
                 rnorm(50, 7, 2.5), rnorm(50, 9, 1.8))
)

ggplot(tissue_expr, aes(x = tissue, y = expression, fill = tissue)) +
  geom_boxplot(alpha = 0.7) +
  geom_jitter(width = 0.2, alpha = 0.3, size = 1) +  # 添加原始数据点
  labs(title = "不同组织的基因表达水平",
       x = "组织类型", y = "Expression (log2)") +
  theme_bw() +
  theme(legend.position = "none")


4.5 小提琴图(geom_violin)

结合了箱线图和密度图的优点。

ggplot(tissue_expr, aes(x = tissue, y = expression, fill = tissue)) +
  geom_violin(alpha = 0.7) +
  geom_boxplot(width = 0.1, fill = "white", outlier.shape = NA) +
  labs(title = "基因表达分布(小提琴图)",
       x = "组织类型", y = "Expression (log2)") +
  theme_minimal() +
  theme(legend.position = "none")


4.6 直方图(geom_histogram)

展示单个连续变量的分布。

ggplot(gene_expr, aes(x = sample_A)) +
  geom_histogram(bins = 20, fill = "skyblue", color = "black", alpha = 0.7) +
  geom_vline(xintercept = mean(gene_expr$sample_A), 
             linetype = "dashed", color = "red", linewidth = 1) +
  labs(title = "基因表达分布",
       x = "Expression (log2)", y = "Frequency") +
  theme_classic()


4.7 密度图(geom_density)

平滑的分布曲线。

ggplot(tissue_expr, aes(x = expression, fill = tissue)) +
  geom_density(alpha = 0.5) +
  labs(title = "各组织基因表达密度分布",
       x = "Expression (log2)", y = "Density") +
  theme_minimal()


5. 美学映射(Aesthetics)

5.1 全局映射 vs 局部映射

# 全局映射:所有图层共享
ggplot(mpg, aes(x = displ, y = hwy, color = class)) +
  geom_point() +
  geom_smooth(method = "lm", se = FALSE)  # 继承 color 映射

# 局部映射:仅特定图层
ggplot(mpg, aes(x = displ, y = hwy)) +
  geom_point(aes(color = class)) +  # 只有点有颜色
  geom_smooth(method = "lm", se = FALSE, color = "black")  # 统一黑色趋势线


5.2 固定属性 vs 映射属性

# 错误示例:将固定值放在 aes() 内
ggplot(mpg, aes(x = displ, y = hwy)) +
  geom_point(aes(color = "blue"))  # 错误!会被当作变量

# 正确示例:固定值在 aes() 外
ggplot(mpg, aes(x = displ, y = hwy)) +
  geom_point(color = "blue", size = 3)


5.3 常用美学属性

属性 说明 适用 Geom
x, y 坐标轴位置 所有
color 点/线的颜色 point, line, text
fill 填充颜色 bar, boxplot, violin, area
size 点/线的大小 point, line, text
alpha 透明度(0-1) 所有
shape 点的形状 point
linetype 线型 line, smooth


6. 分面(Facets)

分面可以将数据按分类变量拆分为多个子图。

6.1 facet_wrap()

按单个变量分面,子图自动排列。

ggplot(mpg, aes(x = displ, y = hwy)) +
  geom_point() +
  facet_wrap(~ class, nrow = 2) +
  labs(title = "按车辆类型分面") +
  theme_bw()


6.2 facet_grid()

按两个变量组合分面,形成网格。

ggplot(mpg, aes(x = displ, y = hwy)) +
  geom_point() +
  facet_grid(drv ~ cyl) +
  labs(title = "按驱动类型和气缸数分面") +
  theme_minimal()


7. 标度(Scales)

标度控制数据值如何映射到视觉属性。

7.1 坐标轴标度

# 对数标度
ggplot(mpg, aes(x = displ, y = hwy)) +
  geom_point() +
  scale_x_log10() +
  scale_y_continuous(breaks = seq(10, 50, by = 5)) +
  labs(title = "对数 x 轴") +
  theme_bw()


7.2 颜色标度

# 手动指定颜色
ggplot(tissue_expr, aes(x = tissue, y = expression, fill = tissue)) +
  geom_boxplot() +
  scale_fill_manual(values = c("Liver" = "#E64B35", 
                                "Brain" = "#4DBBD5",
                                "Muscle" = "#00A087",
                                "Heart" = "#F39B7F")) +
  theme_minimal()

# 使用 ColorBrewer 调色板
ggplot(mpg, aes(x = displ, y = hwy, color = class)) +
  geom_point(size = 3) +
  scale_color_brewer(palette = "Set1") +
  theme_classic()


7.3 渐变色标度

# 连续变量的颜色渐变
gene_heatmap_data <- expand.grid(
  gene = paste0("Gene", 1:20),
  sample = paste0("Sample", 1:10)
)
gene_heatmap_data$expression <- rnorm(200, mean = 5, sd = 2)

ggplot(gene_heatmap_data, aes(x = sample, y = gene, fill = expression)) +
  geom_tile() +
  scale_fill_gradient2(low = "blue", mid = "white", high = "red", 
                       midpoint = 5) +
  labs(title = "基因表达热图") +
  theme_minimal() +
  theme(axis.text.x = element_text(angle = 45, hjust = 1))


8. 主题(Themes)

8.1 内置主题

p <- ggplot(mpg, aes(x = class, fill = class)) +
  geom_bar() +
  labs(title = "不同主题示例")

# theme_gray()(默认)
p + theme_gray()

# theme_bw()(黑白,推荐用于论文)
p + theme_bw()

# theme_minimal()(极简)
p + theme_minimal()

# theme_classic()(经典,无网格线)
p + theme_classic()


8.2 自定义主题元素

ggplot(mpg, aes(x = displ, y = hwy, color = class)) +
  geom_point(size = 3) +
  labs(title = "自定义主题示例",
       x = "发动机排量 (L)",
       y = "高速公路油耗 (mpg)") +
  theme_bw() +
  theme(
    plot.title = element_text(size = 16, face = "bold", hjust = 0.5),
    axis.title = element_text(size = 12, face = "bold"),
    axis.text = element_text(size = 10),
    legend.position = "bottom",
    legend.title = element_text(size = 11, face = "bold"),
    panel.grid.minor = element_blank()
  )


9. 生物信息学实战案例

9.1 火山图(Volcano Plot)

# 模拟差异表达基因数据
set.seed(456)
n_genes <- 5000
volcano_data <- data.frame(
  gene = paste0("Gene", 1:n_genes),
  log2FC = rnorm(n_genes, mean = 0, sd = 1.5),
  pvalue = runif(n_genes, 0, 1)
)
volcano_data$padj <- p.adjust(volcano_data$pvalue, method = "BH")
volcano_data$neg_log10_padj <- -log10(volcano_data$padj)

# 分类:显著上调、下调、不显著
volcano_data <- volcano_data %>%
  mutate(
    diff_expressed = case_when(
      log2FC > 1 & padj < 0.05 ~ "Up-regulated",
      log2FC < -1 & padj < 0.05 ~ "Down-regulated",
      TRUE ~ "Not significant"
    )
  )

# 绘制火山图
ggplot(volcano_data, aes(x = log2FC, y = neg_log10_padj, color = diff_expressed)) +
  geom_point(alpha = 0.5, size = 1.5) +
  scale_color_manual(values = c("Up-regulated" = "red",
                                 "Down-regulated" = "blue",
                                 "Not significant" = "gray")) +
  geom_vline(xintercept = c(-1, 1), linetype = "dashed", color = "gray40") +
  geom_hline(yintercept = -log10(0.05), linetype = "dashed", color = "gray40") +
  labs(title = "Volcano Plot - 差异表达基因",
       x = "Log2 Fold Change",
       y = "-Log10(Adjusted P-value)",
       color = "Status") +
  theme_bw() +
  theme(legend.position = "top")


9.2 MA 图(MA Plot)

# 模拟数据
ma_data <- data.frame(
  gene = paste0("Gene", 1:n_genes),
  baseMean = 10^runif(n_genes, 0, 4),
  log2FC = rnorm(n_genes, 0, 1.2)
)
ma_data$significant <- abs(ma_data$log2FC) > 1

ggplot(ma_data, aes(x = log10(baseMean), y = log2FC, color = significant)) +
  geom_point(alpha = 0.4, size = 1) +
  scale_color_manual(values = c("TRUE" = "red", "FALSE" = "gray50")) +
  geom_hline(yintercept = c(-1, 1), linetype = "dashed", color = "blue") +
  geom_hline(yintercept = 0, linetype = "solid", color = "black") +
  labs(title = "MA Plot",
       x = "Log10 Mean Expression",
       y = "Log2 Fold Change") +
  theme_minimal() +
  theme(legend.position = "none")


9.3 PCA 图(主成分分析)

# 模拟 PCA 结果
set.seed(789)
pca_data <- data.frame(
  sample = paste0("S", 1:30),
  PC1 = c(rnorm(15, -2, 1), rnorm(15, 2, 1)),
  PC2 = c(rnorm(15, 1, 1.5), rnorm(15, -1, 1.5)),
  group = rep(c("Control", "Treatment"), each = 15)
)

ggplot(pca_data, aes(x = PC1, y = PC2, color = group, shape = group)) +
  geom_point(size = 4, alpha = 0.8) +
  stat_ellipse(level = 0.95, linewidth = 1) +  # 添加置信椭圆
  labs(title = "PCA Plot",
       x = "PC1 (45.3% variance)",
       y = "PC2 (23.7% variance)") +
  theme_bw() +
  theme(legend.position = "top",
        legend.title = element_blank())


10. 组合图形

10.1 使用 patchwork 包

# install.packages("patchwork")
library(patchwork)

p1 <- ggplot(mpg, aes(x = displ, y = hwy)) +
  geom_point() +
  labs(title = "Plot 1") +
  theme_bw()

p2 <- ggplot(mpg, aes(x = class, fill = class)) +
  geom_bar() +
  labs(title = "Plot 2") +
  theme_bw() +
  theme(legend.position = "none")

p3 <- ggplot(mpg, aes(x = hwy)) +
  geom_histogram(bins = 20, fill = "steelblue") +
  labs(title = "Plot 3") +
  theme_bw()

# 组合布局
(p1 | p2) / p3 + 
  plot_annotation(title = "组合图形示例", 
                  tag_levels = "A")


11. 保存图形

# 保存当前图形
ggsave("my_plot.png", width = 8, height = 6, dpi = 300)
ggsave("my_plot.pdf", width = 8, height = 6)

# 保存指定图形对象
p <- ggplot(mpg, aes(x = displ, y = hwy)) + geom_point()
ggsave("scatter.png", plot = p, width = 10, height = 8, dpi = 300)


12. 最佳实践与技巧

12.1 数据准备

# 使用 tidyverse 准备数据
library(tidyr)

# 长格式 vs 宽格式
wide_data <- data.frame(
  sample = c("S1", "S2", "S3"),
  GeneA = c(5.2, 6.1, 5.8),
  GeneB = c(7.3, 6.9, 7.5)
)

# 转为长格式用于 ggplot2
long_data <- wide_data %>%
  pivot_longer(cols = c(GeneA, GeneB), 
               names_to = "gene", 
               values_to = "expression")

ggplot(long_data, aes(x = sample, y = expression, fill = gene)) +
  geom_col(position = "dodge") +
  theme_minimal()


12.2 代码组织

# 将 ggplot 代码分行,便于阅读和调试
ggplot(mpg, aes(x = displ, y = hwy, color = class)) +
  geom_point(size = 3, alpha = 0.6) +
  scale_color_brewer(palette = "Set2") +
  labs(
    title = "Engine Displacement vs Highway MPG",
    subtitle = "Colored by vehicle class",
    x = "Engine Displacement (L)",
    y = "Highway Miles per Gallon",
    color = "Vehicle Class"
  ) +
  theme_bw() +
  theme(
    plot.title = element_text(size = 14, face = "bold"),
    legend.position = "bottom"
  )


12.3 常见问题

问题 1:中文字体显示

# macOS / Linux
theme(text = element_text(family = "STHeiti"))

# Windows
theme(text = element_text(family = "SimHei"))

问题 2:坐标轴标签重叠

ggplot(mpg, aes(x = manufacturer, y = hwy)) +
  geom_boxplot() +
  theme(axis.text.x = element_text(angle = 45, hjust = 1))

问题 3:图例位置调整

ggplot(mpg, aes(x = displ, y = hwy, color = class)) +
  geom_point() +
  theme(legend.position = "bottom")  # "top", "left", "right", "none"


13. 扩展包推荐

  • ggpubr:发表级图形(带统计检验)
  • ggsci:科学期刊配色方案
  • ggrepel:自动避免文本标签重叠
  • patchwork:组合多个图形
  • plotly:交互式图形(ggplotly()
  • gganimate:动画图形


14. 小结

本节介绍了 ggplot2 的核心概念和使用方法:

  1. 图形语法:数据 + 美学映射 + 几何对象 + ...
  2. 常用 Geoms:点、线、柱状图、箱线图、小提琴图、热图
  3. 美学映射:全局 vs 局部、固定 vs 映射
  4. 分面facet_wrap()facet_grid()
  5. 标度:坐标轴、颜色、渐变
  6. 主题:内置主题和自定义
  7. 生信案例:火山图、MA 图、PCA 图

下一步学习建议: - 阅读 ggplot2 官方文档 - 浏览 R Graph Gallery 获取灵感 - 练习用真实数据集绘图 - 探索 ggplot2 扩展包生态


参考资源: - ggplot2: Elegant Graphics for Data Analysis - R for Data Science - Data Visualization - ggplot2 Cheatsheet