Intro to R Programming - Lesson 4 (Graphs)

Introduction

Welcome back to the series. Lesson 3 was all about cleaning and reshaping data, which is like tidying your desk before getting real work done. Lesson 4 is the fun part: visualizing your data so that patterns jump out and stories become obvious. In R, the standard way to create modern, publication-quality charts is the ggplot2 package. If you've seen crisp charts on blogs, in academic papers, or in tidyverse tutorials, you were probably looking at ggplot2. This lesson gives you a practical, first-principles introduction to ggplot2 that you can apply immediately to your own projects. The tone here stays casual and blog-like, suitable for CSDN readers who want depth without stiffness, and we'll keep the use of bullet lists to a minimum.

If you're coming from Excel, think of ggplot2 as moving from manual formatting toward code-driven reproducibility. If you're coming from Python's Matplotlib or Seaborn, you'll notice that the "grammar of graphics" style in ggplot2 feels declarative: you describe what you want to see and let the system compose the result. Our plan is straightforward. We'll set up a first plot, talk through the grammar, experiment with geoms, mapping and grouping, scales and transformations, facets for small multiples, and the finishing touches like labels, themes, and exporting. Along the way, I'll try to connect each idea to the questions you actually ask of your data, so this is not just about syntax but also about why certain visualization choices help you reason better.

We'll use the built-in mtcars dataset because it's convenient and familiar, but the same patterns apply to your own data frames. Feel free to replace mtcars with your dataset as you follow along. As always, keep your code in an RMarkdown document so that your analysis is repeatable, versionable, and shareable.

{r 复制代码

library(ggplot2)
library(dplyr)
theme_set(theme_minimal())

The grammar of graphics in 60 seconds

A ggplot has a few core components that you'll use again and again. There is the data frame that holds your observations and variables. There is an aesthetic mapping that says which variables map to visual channels like x-position, y-position, color, size, fill, and shape. There are geoms, short for geometric objects, that define how to draw your data as points, lines, bars, densities, or boxes. There are scales that translate data values into positions and colors on the page. There are guides like axes and legends that help readers decode the picture. Finally, there are themes that govern typography, grid lines, margins, background, and other polish. As you'll see, most charts are variations on this template, which is why ggplot2 scales so well from simple to sophisticated plots.

Your first ggplot

Start with the simplest useful scatter plot: miles per gallon as a function of horsepower. If you're exploring fuel efficiency, this is a natural first view, and you can eyeball whether more power tends to cost you mileage.

{r} 复制代码

ggplot(mtcars, aes(x = hp, y = mpg)) +
  geom_point()

The pattern is already visible: higher horsepower tends to come with lower miles per gallon. You didn't manually place points; you described the mapping and ggplot2 rendered the layer. That is the core rhythm: declare, then layer.

Mapping vs. setting

Aesthetic values can be mapped to data or set to constants. Color is a good demonstration. If you write aes(color = ...), you're mapping a variable to color, which creates a legend and multiple colors. If you write color = "steelblue" outside aes(), you're telling ggplot to draw everything in that single color. This distinction trips people up initially, so it's worth practicing explicitly.

{r} 复制代码

# Mapping color to a variable produces a legend and multiple colors
ggplot(mtcars, aes(hp, mpg, color = factor(cyl))) +
  geom_point(size = 3)

# Setting a constant color uses one color for all points, no legend
ggplot(mtcars, aes(hp, mpg)) +
  geom_point(color = "steelblue", size = 3)

When you want to emphasize groups, map color or shape to a categorical variable. When you want a unified look, set a constant. For numeric variables, you can map to color or size to emphasize magnitude, but always check that the encoding helps, rather than distracts.

Geoms you'll use on day one

Scatter plots are a staple, but the grammar lets you swap in other geoms without rewriting everything from scratch. Keeping the same data and aesthetics, change the geometry and you flip the visual metaphor. Lines imply order, bars imply counts or aggregates, boxes summarize distributions. Try a handful to feel the differences.

{r} 复制代码

# Lines: best when x has an inherent order (time, distance, rank)
ggplot(mtcars, aes(wt, mpg)) +
  geom_line()

# Bars: by default ggplot2 counts rows per x category (stat = "count")
ggplot(mtcars, aes(x = factor(cyl))) +
  geom_bar()

# Histograms: distribution of a numeric variable via bins
ggplot(mtcars, aes(x = mpg)) +
  geom_histogram(bins = 10)

# Density: a smoothed alternative to a histogram
ggplot(mtcars, aes(x = mpg)) +
  geom_density()

# Boxplots: compare distributions across categories
ggplot(mtcars, aes(x = factor(cyl), y = mpg)) +
  geom_boxplot()

# Violin: combine density shape with group comparisons
ggplot(mtcars, aes(x = factor(cyl), y = mpg)) +
  geom_violin(trim = FALSE)

While you're experimenting, ask yourself what question the plot answers. If you want to compare centers and variation across groups, boxplots or violins are ideal. If you want to see relationship and spread between two numeric variables, scatter is the default, and adding a fitted line can be illuminating, which we'll do shortly.

Adding statistical layers

The grammar makes it easy to overlay simple statistical summaries. A smoothing line often helps you see the overall trend without imposing a rigid parametric form. In ggplot2, geom_smooth() gives you multiple options. The default smooth is a loess curve for smaller datasets and a GAM for larger ones; if you want a straight line, set method = "lm" for a linear model.

{r} 复制代码

ggplot(mtcars, aes(hp, mpg)) +
  geom_point(alpha = 0.8) +
  geom_smooth(se = FALSE)

ggplot(mtcars, aes(hp, mpg, color = factor(cyl))) +
  geom_point() +
  geom_smooth(se = FALSE, method = "lm")

There is no magic in the smooth; it's a layer. That means you can control its aesthetics, include or exclude it from legends, and compute separate trends per group by mapping color or linetype.

Scales and transformations

Scales handle the translation from data space to the marks on the plot. When values are skewed or span orders of magnitude, a log transform can reveal structure that a linear axis hides. You can log-transform the data before plotting or tell the scale to transform, which preserves the original data for statistics while changing only the display.

{r} 复制代码

# Transform the x scale to log10 without altering raw data
ggplot(mtcars, aes(hp, mpg)) +
  geom_point() +
  scale_x_log10()

Scales also manage breaks, labels, limits, and color palettes. If a default axis picks awkward ticks, specify your own breaks and labels to improve readability. For discrete scales like color for categories, consider palettes that are colorblind-friendly and print-friendly.

{r} 复制代码

ggplot(mtcars, aes(hp, mpg, color = factor(cyl))) +
  geom_point(size = 3) +
  scale_x_continuous(breaks = seq(50, 350, 50)) +
  scale_y_continuous(labels = function(x) paste0(x, " mpg"))

Think of scales as the place you negotiate with your reader: how dense should ticks be, which number format is friendliest, do you want to cap the range, and which legend title makes sense.

Faceting for small multiples

Facets slice your data into panels, one per category, which is often better than throwing everything into a single busy plot. When groups overlap or the relationship seems to vary by subgroup, faceting lets you compare apples to apples.

{r} 复制代码

ggplot(mtcars, aes(hp, mpg)) +
  geom_point() +
  facet_wrap(~ cyl)

If you prefer a grid defined by two variables, use facet_grid(rows ~ cols). For example, row panels per number of gears and column panels per number of cylinders give a compact visual of how relationships vary in both dimensions.

{r} 复制代码

ggplot(mtcars, aes(hp, mpg)) +
  geom_point() +
  facet_grid(gear ~ cyl)

Facets make comparisons fair because axes are aligned across panels by default. If your categories have wildly different ranges, you can let scales vary with scales = "free", but do so sparingly because you lose the cross-panel alignment that makes small multiples powerful.

Labels, titles, and captions

A good plot is readable without you standing next to it. That means clear titles, axis labels that use human language instead of cryptic codes, and a caption that cites the data source or explains filters. The labs() function centralizes this polish so you're not spelunking through theme settings for simple text.

{r} 复制代码

ggplot(mtcars, aes(hp, mpg)) +
  geom_point() +
  labs(
    title = "Fuel efficiency decreases as horsepower increases",
    subtitle = "mtcars dataset: 32 models from 1973--74",
    x = "Horsepower",
    y = "Miles per gallon",
    color = "Cylinders",
    caption = "Source: base R 'mtcars'"
  )

Use subtitles to add context and captions to cite sources or describe caveats. Avoid shouting in all caps and keep titles informative rather than clever. Readers appreciate clarity.

Themes: the last 10% that matters

Themes set the visual style. For reports, theme_minimal() or theme_classic() often hits the right tone; for slides, theme_light() can be easier to read on projectors. You can also modify theme elements directly, which is less scary than it first appears.

{r} 复制代码

p <- ggplot(mtcars, aes(hp, mpg, color = factor(cyl))) +
  geom_point(size = 3) +
  labs(title = "Horsepower vs. MPG by cylinder count", x = "HP", y = "MPG", color = "Cyl")

p + theme_minimal(base_size = 12)

p + theme_classic(base_size = 12)

p + theme(
  plot.title = element_text(face = "bold"),
  panel.grid.minor = element_blank(),
  legend.position = "bottom",
  axis.title.x = element_text(margin = margin(t = 8))
)

A consistent theme across all figures in a report pays dividends in perceived quality. If you find yourself reusing the same tweaks, wrap them into a helper like theme_my_report() and call it once per plot.

Annotations that tell a story

Beyond titles and labels, annotations pull the reader's eye to the key insight. If a particular point is a surprising outlier or if a threshold matters, annotate it. The annotate() function lets you add text and arrows; horizontal and vertical reference lines are available with geom_hline() and geom_vline().

{r} 复制代码

mpg_vs_hp <- ggplot(mtcars, aes(hp, mpg)) +
  geom_point() +
  geom_smooth(se = FALSE, color = "grey50") +
  geom_vline(xintercept = 200, linetype = "dashed") +
  annotate("text", x = 220, y = 30, label = "200 HP threshold", hjust = 0)

mpg_vs_hp

Treat annotation as part of the narrative. Don't overload the plot with labels for every point; instead, choose the few that advance the argument you're making.

Working with discrete vs. continuous scales

Color and fill behave differently for discrete and continuous variables. For discrete variables, ggplot chooses a qualitative palette. For continuous variables, you get a gradient. If your gradient implies order that doesn't exist or if a rainbow palette makes interpretation hard, pick something calmer and perceptually uniform. Even without extra packages, ggplot's default sequential gradients are decent. For discrete scales, use informative legend titles and rename factor levels so that they read like labels rather than codes.

{r} 复制代码

ggplot(mtcars, aes(hp, mpg, color = wt)) +
  geom_point(size = 3) +
  labs(color = "Weight (1000 lbs)")

The gradient encoding of weight communicates magnitude smoothly; if you wanted bins, you could cut the variable into categories first with cut() and then map to a discrete scale.

Putting it together: a mini workflow

Let's script a tiny, realistic workflow to go from raw data to a publishable figure. Suppose your question is: how does fuel efficiency vary with horsepower for cars with automatic versus manual transmissions, and does cylinder count change the picture? The steps might be: recode transmission, fit simple lines per group, facet by cylinder count, and apply a consistent theme and labels.

{r} 复制代码

mt <- mtcars %>%
  mutate(
    transmission = ifelse(am == 0, "Automatic", "Manual"),
    cyl = factor(cyl)
  )

ggplot(mt, aes(hp, mpg, color = transmission)) +
  geom_point(alpha = 0.8) +
  geom_smooth(method = "lm", se = FALSE) +
  facet_wrap(~ cyl) +
  labs(
    title = "MPG declines with horsepower across transmissions and cylinder counts",
    subtitle = "Linear trends are broadly similar; absolute levels vary by cylinder count",
    x = "Horsepower",
    y = "Miles per gallon",
    color = "Transmission",
    caption = "Source: mtcars"
  ) +
  theme_minimal(base_size = 12) +
  theme(legend.position = "bottom")

This chart is already presentation-ready. If you need to match a corporate style, swap in your font, colors, and logo, but the data story is encoded compactly and reproducibly.

Saving your plots like a pro

You can keep plots as objects and then write them to disk. Storing plot objects helps you reuse them, add layers later, or export to multiple formats and sizes without recomputing. The ggsave() function writes the last plot by default, or you can pass a plot object explicitly. By specifying width, height, and units, you make your output deterministic and friendly to downstream documents.

{r} 复制代码

p <- ggplot(mtcars, aes(hp, mpg)) +
  geom_point(color = "steelblue", size = 2) +
  labs(title = "Simple scatter: HP vs MPG", x = "HP", y = "MPG")

ggsave("hp_vs_mpg.png", p, width = 6, height = 4, dpi = 300)

If you're targeting vector formats for print or high-DPI screens, use PDF or SVG where appropriate. For the web and CSDN blog posts, a 1200--1600 px width PNG at 2x pixel density is usually crisp without being heavy.

Common mistakes and how to avoid them

Forgetting the + at the end of a line is the classic hiccup. Remember that ggplot layers chain with + and not with commas. Another common one is placing aesthetics outside aes() when you meant to map them. If your legend doesn't appear, check whether you set a constant instead of mapping a variable. If points look too dark or cluttered, use transparency via alpha or jitter points slightly with geom_jitter() to avoid overlap for discrete x-values. When a bar chart looks the same regardless of y, make sure you understand that geom_bar() uses counts by default; if you already have aggregated y values, use geom_col() instead.

{r} 复制代码

# If you already computed y, use geom_col()
df <- tibble::tibble(group = c("A", "B", "C"), value = c(10, 7, 12))

ggplot(df, aes(group, value)) +
  geom_col()

Diagnostics like these are part of the craft. The more you practice, the more these fixes will be automatic muscle memory.

From plot to narrative

A graph by itself is helpful, but a graph embedded in a narrative is persuasive. When you write a report or a blog post, introduce the question, present the figure that directly addresses it, and interpret what the reader should take away. If there are caveats, state them plainly. If a pattern depends on a small sample or sensitive to outliers, show robustness checks or alternative views. Visualizations are powerful because they compress information, but the responsibility to contextualize remains with you.

Where to go after this lesson

We've scratched the surface. You can do maps with geom_sf(), marginal plots with ggExtra, ridge plots for distributions over an ordered variable with ggridges, interactive variants with plotly or ggiraph, and multi-plot layouts with patchwork or cowplot. You don't need any of these to become productive, but it's nice to know that the ecosystem has your back when you need to level up.

Summary

You learned the core grammar that powers ggplot2 and practiced building and polishing plots that communicate real insights. You saw how to map variables to aesthetics, when to set constants, how to switch geoms, add statistical layers, transform scales, facet into small multiples, annotate key points, apply themes, and export images. The guiding principle is to think in layers: start with the bare minimum that answers a question, then add just enough structure and polish so that anyone can read your plot in a few seconds and learn something true about the data. Lesson 5 will take you into advanced data management techniques and some programming patterns that make repetitive tasks easier and safer.

Quiz (no answers)

Explain the difference between mapping color inside aes() and setting color outside of aes(). Provide one concise example of each and say when you would choose one over the other.
Create a scatter plot of wt vs mpg with a linear regression line and no standard error ribbon. Then facet the plot by cyl. Describe what you see.
When and why would you apply a log transform to an axis? Show how to log-transform the x-axis for a horsepower vs mpg plot without altering the underlying data.
What is the default behavior of geom_bar() and how does it differ from geom_col()? Write one short example where geom_col() is the correct choice.
Add an annotation to a plot that points to an outlier and includes text describing why it's notable. Mention which geoms or functions you used.
Describe a situation where facet_grid(rows ~ cols) communicates a pattern more effectively than facet_wrap(). Provide a code sketch using mtcars variables.
What are two visual decisions in the scales layer that can improve readability for your audience, and how would you implement them in code?

Quiz answers

Please comment in the discussion area if you wish to obtain the answers to quiz.