ggplot2 数据可视化¶
1. 引言¶
在前面的课程中,我们学习了 R 的数据结构、数据处理(tidyverse)、流程控制和函数编写。现在我们来学习如何用 ggplot2 创建优雅、专业的数据可视化图表。
ggplot2 是 tidyverse 生态系统的核心包之一,基于 Leland Wilkinson 的《图形语法》(Grammar of Graphics)理论。它的核心思想是:将图形拆解为独立的图层,通过组合这些图层来构建复杂的可视化。
本节目标:掌握 ggplot2 的基本语法、常用图层类型、美学映射、分面和主题定制,能够绘制生物信息学常见的图表(散点图、箱线图、热图、火山图等)。
2. ggplot2 的图形语法¶
2.1 核心概念¶
ggplot2 将一个图形分解为以下组件:
- 数据(Data):要可视化的数据框
- 美学映射(Aesthetics, aes):数据变量到视觉属性的映射(x、y、颜色、大小等)
- 几何对象(Geoms):数据的几何表示(点、线、柱状图等)
- 统计变换(Stats):对数据的统计转换(计数、平滑等)
- 标度(Scales):控制美学映射的细节(颜色方案、坐标轴范围等)
- 坐标系统(Coordinate systems):数据到平面的映射方式
- 分面(Facets):将数据分组并创建多个子图
- 主题(Themes):控制图形的整体外观
2.2 基本语法结构¶
ggplot(data = <DATA>, mapping = aes(<MAPPINGS>)) +
<GEOM_FUNCTION>() +
<SCALE_FUNCTION>() +
<FACET_FUNCTION>() +
<THEME_FUNCTION>()
关键点:
- 使用 + 号连接各个图层(不是管道符 %>% 或 |>)
- ggplot() 初始化图形,指定数据和全局美学映射
- 每个 geom_*() 函数添加一个几何图层
3. 第一个 ggplot2 图形¶
3.1 安装与加载¶
# 安装(如果尚未安装)
# install.packages("ggplot2")
# 或安装整个 tidyverse
# install.packages("tidyverse")
library(ggplot2)
library(dplyr) # 数据处理
3.2 使用内置数据集 - 散点图¶
我们使用 ggplot2 内置的 mpg 数据集(汽车燃油效率数据)作为示例。
# 查看数据
head(mpg)
# 基础散点图:发动机排量 vs 高速公路油耗
ggplot(data = mpg, mapping = aes(x = displ, y = hwy)) +
geom_point()
解读:
- ggplot():初始化图形对象
- aes(x = displ, y = hwy):将 displ 映射到 x 轴,hwy 映射到 y 轴
- geom_point():用点表示数据
3.3 添加颜色映射¶
颜色自动按 class 变量分组,并生成图例。
3.4 添加大小和透明度¶
# 按气缸数设置点大小,添加透明度
ggplot(mpg, aes(x = displ, y = hwy, color = class, size = cyl)) +
geom_point(alpha = 0.6)
注意:
- alpha 控制透明度(0-1),可以放在 aes() 内映射变量,也可以作为固定参数
- size 映射到数值变量时,点的大小会随数值变化
4. 常用几何对象(Geoms)¶
4.1 散点图(geom_point)¶
适用于展示两个连续变量的关系。
# 生物信息学示例:基因表达散点图
set.seed(123)
gene_expr <- data.frame(
gene = paste0("Gene", 1:100),
sample_A = rnorm(100, mean = 5, sd = 2),
sample_B = rnorm(100, mean = 5, sd = 2),
significant = sample(c("Sig", "Not Sig"), 100, replace = TRUE, prob = c(0.3, 0.7))
)
ggplot(gene_expr, aes(x = sample_A, y = sample_B, color = significant)) +
geom_point(size = 2, alpha = 0.7) +
geom_abline(slope = 1, intercept = 0, linetype = "dashed", color = "gray50") +
labs(title = "基因表达相关性",
x = "Sample A (log2 expression)",
y = "Sample B (log2 expression)") +
theme_bw()
4.2 线图(geom_line)¶
适用于时间序列或有序数据。
# 示例:qPCR 时间序列
time_series <- data.frame(
time = rep(0:10, 3),
expression = c(1 * 2^(0:10 * 0.3), # Gene1
1 * 2^(0:10 * 0.5), # Gene2
1 * 2^(0:10 * 0.2)), # Gene3
gene = rep(c("Gene1", "Gene2", "Gene3"), each = 11)
)
ggplot(time_series, aes(x = time, y = expression, color = gene)) +
geom_line(linewidth = 1) +
geom_point(size = 2) +
scale_y_log10() + # 对数坐标
labs(title = "基因表达时间序列",
x = "Time (hours)",
y = "Expression (log scale)") +
theme_minimal()
4.3 柱状图(geom_bar / geom_col)¶
geom_bar():自动计数,用于分类变量geom_col():直接使用 y 值,用于已汇总数据
# geom_bar:自动计数
ggplot(mpg, aes(x = class)) +
geom_bar(fill = "steelblue") +
labs(title = "车辆类型分布", x = "类型", y = "数量") +
theme_classic()
# geom_col:使用预先计算的值
pathway_counts <- data.frame(
pathway = c("MAPK", "PI3K-AKT", "WNT", "Notch", "TGF-beta"),
gene_count = c(45, 38, 25, 18, 22)
)
ggplot(pathway_counts, aes(x = reorder(pathway, gene_count), y = gene_count)) +
geom_col(fill = "coral") +
coord_flip() + # 水平柱状图
labs(title = "信号通路基因数量", x = NULL, y = "基因数") +
theme_minimal()
4.4 箱线图(geom_boxplot)¶
展示数据分布、中位数、四分位数和异常值。
# 示例:不同组织的基因表达
tissue_expr <- data.frame(
tissue = rep(c("Liver", "Brain", "Muscle", "Heart"), each = 50),
expression = c(rnorm(50, 8, 2), rnorm(50, 6, 1.5),
rnorm(50, 7, 2.5), rnorm(50, 9, 1.8))
)
ggplot(tissue_expr, aes(x = tissue, y = expression, fill = tissue)) +
geom_boxplot(alpha = 0.7) +
geom_jitter(width = 0.2, alpha = 0.3, size = 1) + # 添加原始数据点
labs(title = "不同组织的基因表达水平",
x = "组织类型", y = "Expression (log2)") +
theme_bw() +
theme(legend.position = "none")
4.5 小提琴图(geom_violin)¶
结合了箱线图和密度图的优点。
ggplot(tissue_expr, aes(x = tissue, y = expression, fill = tissue)) +
geom_violin(alpha = 0.7) +
geom_boxplot(width = 0.1, fill = "white", outlier.shape = NA) +
labs(title = "基因表达分布(小提琴图)",
x = "组织类型", y = "Expression (log2)") +
theme_minimal() +
theme(legend.position = "none")
4.6 直方图(geom_histogram)¶
展示单个连续变量的分布。
ggplot(gene_expr, aes(x = sample_A)) +
geom_histogram(bins = 20, fill = "skyblue", color = "black", alpha = 0.7) +
geom_vline(xintercept = mean(gene_expr$sample_A),
linetype = "dashed", color = "red", linewidth = 1) +
labs(title = "基因表达分布",
x = "Expression (log2)", y = "Frequency") +
theme_classic()
4.7 密度图(geom_density)¶
平滑的分布曲线。
ggplot(tissue_expr, aes(x = expression, fill = tissue)) +
geom_density(alpha = 0.5) +
labs(title = "各组织基因表达密度分布",
x = "Expression (log2)", y = "Density") +
theme_minimal()
5. 美学映射(Aesthetics)¶
5.1 全局映射 vs 局部映射¶
# 全局映射:所有图层共享
ggplot(mpg, aes(x = displ, y = hwy, color = class)) +
geom_point() +
geom_smooth(method = "lm", se = FALSE) # 继承 color 映射
# 局部映射:仅特定图层
ggplot(mpg, aes(x = displ, y = hwy)) +
geom_point(aes(color = class)) + # 只有点有颜色
geom_smooth(method = "lm", se = FALSE, color = "black") # 统一黑色趋势线
5.2 固定属性 vs 映射属性¶
# 错误示例:将固定值放在 aes() 内
ggplot(mpg, aes(x = displ, y = hwy)) +
geom_point(aes(color = "blue")) # 错误!会被当作变量
# 正确示例:固定值在 aes() 外
ggplot(mpg, aes(x = displ, y = hwy)) +
geom_point(color = "blue", size = 3)
5.3 常用美学属性¶
| 属性 | 说明 | 适用 Geom |
|---|---|---|
x, y |
坐标轴位置 | 所有 |
color |
点/线的颜色 | point, line, text |
fill |
填充颜色 | bar, boxplot, violin, area |
size |
点/线的大小 | point, line, text |
alpha |
透明度(0-1) | 所有 |
shape |
点的形状 | point |
linetype |
线型 | line, smooth |
6. 分面(Facets)¶
分面可以将数据按分类变量拆分为多个子图。
6.1 facet_wrap()¶
按单个变量分面,子图自动排列。
ggplot(mpg, aes(x = displ, y = hwy)) +
geom_point() +
facet_wrap(~ class, nrow = 2) +
labs(title = "按车辆类型分面") +
theme_bw()
6.2 facet_grid()¶
按两个变量组合分面,形成网格。
ggplot(mpg, aes(x = displ, y = hwy)) +
geom_point() +
facet_grid(drv ~ cyl) +
labs(title = "按驱动类型和气缸数分面") +
theme_minimal()
7. 标度(Scales)¶
标度控制数据值如何映射到视觉属性。
7.1 坐标轴标度¶
# 对数标度
ggplot(mpg, aes(x = displ, y = hwy)) +
geom_point() +
scale_x_log10() +
scale_y_continuous(breaks = seq(10, 50, by = 5)) +
labs(title = "对数 x 轴") +
theme_bw()
7.2 颜色标度¶
# 手动指定颜色
ggplot(tissue_expr, aes(x = tissue, y = expression, fill = tissue)) +
geom_boxplot() +
scale_fill_manual(values = c("Liver" = "#E64B35",
"Brain" = "#4DBBD5",
"Muscle" = "#00A087",
"Heart" = "#F39B7F")) +
theme_minimal()
# 使用 ColorBrewer 调色板
ggplot(mpg, aes(x = displ, y = hwy, color = class)) +
geom_point(size = 3) +
scale_color_brewer(palette = "Set1") +
theme_classic()
7.3 渐变色标度¶
# 连续变量的颜色渐变
gene_heatmap_data <- expand.grid(
gene = paste0("Gene", 1:20),
sample = paste0("Sample", 1:10)
)
gene_heatmap_data$expression <- rnorm(200, mean = 5, sd = 2)
ggplot(gene_heatmap_data, aes(x = sample, y = gene, fill = expression)) +
geom_tile() +
scale_fill_gradient2(low = "blue", mid = "white", high = "red",
midpoint = 5) +
labs(title = "基因表达热图") +
theme_minimal() +
theme(axis.text.x = element_text(angle = 45, hjust = 1))
8. 主题(Themes)¶
8.1 内置主题¶
p <- ggplot(mpg, aes(x = class, fill = class)) +
geom_bar() +
labs(title = "不同主题示例")
# theme_gray()(默认)
p + theme_gray()
# theme_bw()(黑白,推荐用于论文)
p + theme_bw()
# theme_minimal()(极简)
p + theme_minimal()
# theme_classic()(经典,无网格线)
p + theme_classic()
8.2 自定义主题元素¶
ggplot(mpg, aes(x = displ, y = hwy, color = class)) +
geom_point(size = 3) +
labs(title = "自定义主题示例",
x = "发动机排量 (L)",
y = "高速公路油耗 (mpg)") +
theme_bw() +
theme(
plot.title = element_text(size = 16, face = "bold", hjust = 0.5),
axis.title = element_text(size = 12, face = "bold"),
axis.text = element_text(size = 10),
legend.position = "bottom",
legend.title = element_text(size = 11, face = "bold"),
panel.grid.minor = element_blank()
)
9. 生物信息学实战案例¶
9.1 火山图(Volcano Plot)¶
# 模拟差异表达基因数据
set.seed(456)
n_genes <- 5000
volcano_data <- data.frame(
gene = paste0("Gene", 1:n_genes),
log2FC = rnorm(n_genes, mean = 0, sd = 1.5),
pvalue = runif(n_genes, 0, 1)
)
volcano_data$padj <- p.adjust(volcano_data$pvalue, method = "BH")
volcano_data$neg_log10_padj <- -log10(volcano_data$padj)
# 分类:显著上调、下调、不显著
volcano_data <- volcano_data %>%
mutate(
diff_expressed = case_when(
log2FC > 1 & padj < 0.05 ~ "Up-regulated",
log2FC < -1 & padj < 0.05 ~ "Down-regulated",
TRUE ~ "Not significant"
)
)
# 绘制火山图
ggplot(volcano_data, aes(x = log2FC, y = neg_log10_padj, color = diff_expressed)) +
geom_point(alpha = 0.5, size = 1.5) +
scale_color_manual(values = c("Up-regulated" = "red",
"Down-regulated" = "blue",
"Not significant" = "gray")) +
geom_vline(xintercept = c(-1, 1), linetype = "dashed", color = "gray40") +
geom_hline(yintercept = -log10(0.05), linetype = "dashed", color = "gray40") +
labs(title = "Volcano Plot - 差异表达基因",
x = "Log2 Fold Change",
y = "-Log10(Adjusted P-value)",
color = "Status") +
theme_bw() +
theme(legend.position = "top")
9.2 MA 图(MA Plot)¶
# 模拟数据
ma_data <- data.frame(
gene = paste0("Gene", 1:n_genes),
baseMean = 10^runif(n_genes, 0, 4),
log2FC = rnorm(n_genes, 0, 1.2)
)
ma_data$significant <- abs(ma_data$log2FC) > 1
ggplot(ma_data, aes(x = log10(baseMean), y = log2FC, color = significant)) +
geom_point(alpha = 0.4, size = 1) +
scale_color_manual(values = c("TRUE" = "red", "FALSE" = "gray50")) +
geom_hline(yintercept = c(-1, 1), linetype = "dashed", color = "blue") +
geom_hline(yintercept = 0, linetype = "solid", color = "black") +
labs(title = "MA Plot",
x = "Log10 Mean Expression",
y = "Log2 Fold Change") +
theme_minimal() +
theme(legend.position = "none")
9.3 PCA 图(主成分分析)¶
# 模拟 PCA 结果
set.seed(789)
pca_data <- data.frame(
sample = paste0("S", 1:30),
PC1 = c(rnorm(15, -2, 1), rnorm(15, 2, 1)),
PC2 = c(rnorm(15, 1, 1.5), rnorm(15, -1, 1.5)),
group = rep(c("Control", "Treatment"), each = 15)
)
ggplot(pca_data, aes(x = PC1, y = PC2, color = group, shape = group)) +
geom_point(size = 4, alpha = 0.8) +
stat_ellipse(level = 0.95, linewidth = 1) + # 添加置信椭圆
labs(title = "PCA Plot",
x = "PC1 (45.3% variance)",
y = "PC2 (23.7% variance)") +
theme_bw() +
theme(legend.position = "top",
legend.title = element_blank())
10. 组合图形¶
10.1 使用 patchwork 包¶
# install.packages("patchwork")
library(patchwork)
p1 <- ggplot(mpg, aes(x = displ, y = hwy)) +
geom_point() +
labs(title = "Plot 1") +
theme_bw()
p2 <- ggplot(mpg, aes(x = class, fill = class)) +
geom_bar() +
labs(title = "Plot 2") +
theme_bw() +
theme(legend.position = "none")
p3 <- ggplot(mpg, aes(x = hwy)) +
geom_histogram(bins = 20, fill = "steelblue") +
labs(title = "Plot 3") +
theme_bw()
# 组合布局
(p1 | p2) / p3 +
plot_annotation(title = "组合图形示例",
tag_levels = "A")
11. 保存图形¶
# 保存当前图形
ggsave("my_plot.png", width = 8, height = 6, dpi = 300)
ggsave("my_plot.pdf", width = 8, height = 6)
# 保存指定图形对象
p <- ggplot(mpg, aes(x = displ, y = hwy)) + geom_point()
ggsave("scatter.png", plot = p, width = 10, height = 8, dpi = 300)
12. 最佳实践与技巧¶
12.1 数据准备¶
# 使用 tidyverse 准备数据
library(tidyr)
# 长格式 vs 宽格式
wide_data <- data.frame(
sample = c("S1", "S2", "S3"),
GeneA = c(5.2, 6.1, 5.8),
GeneB = c(7.3, 6.9, 7.5)
)
# 转为长格式用于 ggplot2
long_data <- wide_data %>%
pivot_longer(cols = c(GeneA, GeneB),
names_to = "gene",
values_to = "expression")
ggplot(long_data, aes(x = sample, y = expression, fill = gene)) +
geom_col(position = "dodge") +
theme_minimal()
12.2 代码组织¶
# 将 ggplot 代码分行,便于阅读和调试
ggplot(mpg, aes(x = displ, y = hwy, color = class)) +
geom_point(size = 3, alpha = 0.6) +
scale_color_brewer(palette = "Set2") +
labs(
title = "Engine Displacement vs Highway MPG",
subtitle = "Colored by vehicle class",
x = "Engine Displacement (L)",
y = "Highway Miles per Gallon",
color = "Vehicle Class"
) +
theme_bw() +
theme(
plot.title = element_text(size = 14, face = "bold"),
legend.position = "bottom"
)
12.3 常见问题¶
问题 1:中文字体显示
# macOS / Linux
theme(text = element_text(family = "STHeiti"))
# Windows
theme(text = element_text(family = "SimHei"))
问题 2:坐标轴标签重叠
ggplot(mpg, aes(x = manufacturer, y = hwy)) +
geom_boxplot() +
theme(axis.text.x = element_text(angle = 45, hjust = 1))
问题 3:图例位置调整
ggplot(mpg, aes(x = displ, y = hwy, color = class)) +
geom_point() +
theme(legend.position = "bottom") # "top", "left", "right", "none"
13. 扩展包推荐¶
- ggpubr:发表级图形(带统计检验)
- ggsci:科学期刊配色方案
- ggrepel:自动避免文本标签重叠
- patchwork:组合多个图形
- plotly:交互式图形(
ggplotly()) - gganimate:动画图形
14. 小结¶
本节介绍了 ggplot2 的核心概念和使用方法:
- 图形语法:数据 + 美学映射 + 几何对象 + ...
- 常用 Geoms:点、线、柱状图、箱线图、小提琴图、热图
- 美学映射:全局 vs 局部、固定 vs 映射
- 分面:
facet_wrap()和facet_grid() - 标度:坐标轴、颜色、渐变
- 主题:内置主题和自定义
- 生信案例:火山图、MA 图、PCA 图
下一步学习建议: - 阅读 ggplot2 官方文档 - 浏览 R Graph Gallery 获取灵感 - 练习用真实数据集绘图 - 探索 ggplot2 扩展包生态
参考资源: - ggplot2: Elegant Graphics for Data Analysis - R for Data Science - Data Visualization - ggplot2 Cheatsheet