3 Data Wrangling in R

In the data sciences — psychology included — there is a stage between data collection and the communication of analytic results that consists of “transforming the data into a usable form, visualizing it, and analyzing it.” This transformation step is known as data wrangling. “Statistics” is often equated with “analysis,” but in practice data wrangling and visualization typically consume the bulk of the time and are among the most consequential parts of the workflow.

3.1 Introducing the tidyverse

This course uses the tidyverse for data wrangling. The tidyverse is both a design philosophy for working with data and a concrete package implementing that philosophy. Install the tidyverse package and load it as follows:

pacman::p_load(tidyverse)

You will see “Attaching core tidyverse packages,” with check marks next to several package names. The tidyverse is a meta-package that bundles a number of sub-packages: dplyr and tidyr handle reshaping; readr handles file I/O; forcats handles factor variables; stringr handles strings; lubridate handles dates; tibble provides a modern data-frame variant; purrr provides functional-programming utilities; and ggplot2 handles visualization.

You will also see a message about Conflicts. This warning, which appears for many packages and not only the tidyverse, signals “function-name collisions.” Up to now you have used base functions such as sqrt() and mean() without any explicit library() call — these are loaded automatically at R startup from the base package. When a subsequently loaded package defines a function of the same name as an already-loaded one, the new definition shadows the old one. The conflict message lists those shadowings. For instance, dplyr::filter() masks stats::filter() means that dplyr’s filter() (loaded as part of the tidyverse) takes precedence over stats’s filter(), which was previously available.

Such name collisions can be confusing. When you need to disambiguate, write package::function — for instance, stats::filter() — as the warning message itself illustrates.

3.2 The pipe operator

Next, the pipe operator. The pipe was introduced by the magrittr package (since included in the tidyverse) and revolutionized data wrangling in R. As of R 4.2, a native pipe is built into the language itself, with no package required. We distinguish the built-in operator as the native pipe when needed.

Let us see why this operator is so useful. The following script computes the (uncorrected) standard deviation of a small dataset.¹ In equations, where \(\bar{x}\) is the arithmetic mean of \(x\): \[v = \sqrt{\frac{1}{n}\sum_{i=1}^n (x_i - \bar{x})^2}\]

dat <- c(10, 13, 15, 12, 14) # data
M <- mean(dat) # mean
dev <- dat - M # deviations
pow <- dev^2 # squared deviations
variance <- mean(pow) # mean of squared deviations is the variance
standardDev <- sqrt(variance) # square root of variance is the standard deviation

To arrive at standardDev, we introduced four intermediate objects: M, dev, pow, and variance. The objects being created sit on the left of <-, and the operations applied to them sit on the right; mentally, the reader processes each line as “create this object, by performing this computation.”

The pipe operator makes that flow explicit. Written %>%, it passes the result of its left-hand operand as the first argument of the function on its right. Rewritten with pipes, the calculation above becomes (note that Ctrl/Cmd+Shift+M inserts %>%):

dat <- c(10, 13, 15, 12, 14)
standardDev <- dat %>%
  {
    . - mean(.)
  } %>%
  {
    .^2
  } %>%
  mean() %>%
  sqrt()

The period (.) is the placeholder that refers to whatever was passed in by the previous stage; the second line is therefore {dat - mean(dat)}, the deviations from the mean. The subsequent stages square the deviations, take the mean, and take the square root. The placeholder is omitted where its position is unambiguous (e.g., as the first argument to mean() and sqrt()).

Reading the piped version, the script flow — data → deviations → squared → mean → square root — matches the conceptual flow of the calculation, making it easier to follow.

The same computation can also be written in nested form:

standardDev <- sqrt(mean((dat - mean(dat))^2))

This style — \(y = h(g(f(x)))\) — requires the reader to peel back the parentheses from the inside out, reversing the natural reading order. The pipe version, by contrast, reads x %>% f() %>% g() %>% h() -> y, in the same order as the underlying thought.

From here on we use the pipe extensively. Get comfortable with it (and with its keyboard shortcut).

3.3 Exercises 1. The pipe operator

Confirm via the help system that sqrt() and mean() belong to the base package. Where in the help page is this information shown? What about filter() and lag()?
After loading the tidyverse, dplyr::filter() takes precedence over stats::filter(). View the help for dplyr::filter().
View the help for stats::filter(), the function that has been shadowed.
Using the data above, compute the mean absolute deviation (MeanAD) and the median absolute deviation (MAD) with the pipe. R’s absolute-value function is abs(). The definitions are:

\[MeanAD = \frac{1}{n}\sum_{i=1}^n|x_i - \bar{x}|\] \[MAD = \mathrm{median}(|x_1-\mathrm{median}(x)|,\cdots,|x_n-\mathrm{median}(x)|)\]

3.4 Column and row selection

Now to more substantive data wrangling with the tidyverse. We begin by selecting subsets of columns and rows — a basic operation for narrowing the data to which an analysis is applied.

3.4.1 Column selection

Column selection is performed by select(), from dplyr in the tidyverse. Note that several other packages (notably MASS) define a function of the same name, so be alert for conflicts.

We illustrate with iris, a sample dataset included in base R. Since iris has 150 rows, we use head() to show only the first few rows in the snippets that follow; in your own exercises head() is not strictly required.

# inspect the iris dataset
iris %>% head()

  Sepal.Length Sepal.Width Petal.Length Petal.Width Species
1          5.1         3.5          1.4         0.2  setosa
2          4.9         3.0          1.4         0.2  setosa
3          4.7         3.2          1.3         0.2  setosa
4          4.6         3.1          1.5         0.2  setosa
5          5.0         3.6          1.4         0.2  setosa
6          5.4         3.9          1.7         0.4  setosa

# extract a subset of variables
iris %>%
  select(Sepal.Length, Species) %>%
  head()

  Sepal.Length Species
1          5.1  setosa
2          4.9  setosa
3          4.7  setosa
4          4.6  setosa
5          5.0  setosa
6          5.4  setosa

A minus sign drops variables:

iris %>%
  select(-Species) %>%
  head()

  Sepal.Length Sepal.Width Petal.Length Petal.Width
1          5.1         3.5          1.4         0.2
2          4.9         3.0          1.4         0.2
3          4.7         3.2          1.3         0.2
4          4.6         3.1          1.5         0.2
5          5.0         3.6          1.4         0.2
6          5.4         3.9          1.7         0.4

# dropping several variables
iris %>%
  select(-c(Petal.Length, Petal.Width)) %>%
  head()

  Sepal.Length Sepal.Width Species
1          5.1         3.5  setosa
2          4.9         3.0  setosa
3          4.7         3.2  setosa
4          4.6         3.1  setosa
5          5.0         3.6  setosa
6          5.4         3.9  setosa

select() also accepts pattern-matching helpers:

starts_with()
ends_with()
contains()
matches()

Examples:

# variables starting with a given string
iris %>%
  select(starts_with("Petal")) %>%
  head()

  Petal.Length Petal.Width
1          1.4         0.2
2          1.4         0.2
3          1.3         0.2
4          1.5         0.2
5          1.4         0.2
6          1.7         0.4

# variables ending with a given string
iris %>%
  select(ends_with("Length")) %>%
  head()

  Sepal.Length Petal.Length
1          5.1          1.4
2          4.9          1.4
3          4.7          1.3
4          4.6          1.5
5          5.0          1.4
6          5.4          1.7

# variables containing a substring
iris %>%
  select(contains("etal")) %>%
  head()

  Petal.Length Petal.Width
1          1.4         0.2
2          1.4         0.2
3          1.3         0.2
4          1.5         0.2
5          1.4         0.2
6          1.7         0.4

# selection by regular expression
iris %>%
  select(matches(".t.")) %>%
  head()

  Sepal.Length Sepal.Width Petal.Length Petal.Width
1          5.1         3.5          1.4         0.2
2          4.9         3.0          1.4         0.2
3          4.7         3.2          1.3         0.2
4          4.6         3.1          1.5         0.2
5          5.0         3.6          1.4         0.2
6          5.4         3.9          1.7         0.4

Regular expressions are a notation for specifying string patterns, common to many programming languages and even widely used in bibliographic search systems. Patterns are constructed from ordinary characters together with metacharacters denoting “any character,” “start of string,” “end of string,” and so on. A Web search for “regular expression” will surface many tutorials.

3.4.2 Row selection

If columns of a data frame correspond to variables, then rows correspond to observations (cases). Row selection is performed by dplyr::filter().

# rows where Sepal.Length exceeds 6
iris %>%
  filter(Sepal.Length > 6) %>%
  head()

  Sepal.Length Sepal.Width Petal.Length Petal.Width    Species
1          7.0         3.2          4.7         1.4 versicolor
2          6.4         3.2          4.5         1.5 versicolor
3          6.9         3.1          4.9         1.5 versicolor
4          6.5         2.8          4.6         1.5 versicolor
5          6.3         3.3          4.7         1.6 versicolor
6          6.6         2.9          4.6         1.3 versicolor

# rows of a particular species
iris %>%
  filter(Species == "versicolor") %>%
  head()

  Sepal.Length Sepal.Width Petal.Length Petal.Width    Species
1          7.0         3.2          4.7         1.4 versicolor
2          6.4         3.2          4.5         1.5 versicolor
3          6.9         3.1          4.9         1.5 versicolor
4          5.5         2.3          4.0         1.3 versicolor
5          6.5         2.8          4.6         1.5 versicolor
6          5.7         2.8          4.5         1.3 versicolor

# combining conditions
iris %>%
  filter(Species != "versicolor", Sepal.Length > 6) %>%
  head()

  Sepal.Length Sepal.Width Petal.Length Petal.Width   Species
1          6.3         3.3          6.0         2.5 virginica
2          7.1         3.0          5.9         2.1 virginica
3          6.3         2.9          5.6         1.8 virginica
4          6.5         3.0          5.8         2.2 virginica
5          7.6         3.0          6.6         2.1 virginica
6          7.3         2.9          6.3         1.8 virginica

== is the equality operator: a single = would be interpreted as assignment, so equality testing requires the doubled form. Likewise, != is the inequality operator, true when the operands differ.

3.5 Creating and reassigning variables

Creating new variables from existing ones — or reassigning values — is among the most common operations in data wrangling. One may, for instance, dichotomize a continuous variable at some threshold to produce a categorical “high/low” variable, or apply a linear transformation to change units. Such variable manipulations — “engineering features from existing variables” — are performed with dplyr::mutate(). An example:

mutate(iris, Twice = Sepal.Length * 2) %>% head()

  Sepal.Length Sepal.Width Petal.Length Petal.Width Species Twice
1          5.1         3.5          1.4         0.2  setosa  10.2
2          4.9         3.0          1.4         0.2  setosa   9.8
3          4.7         3.2          1.3         0.2  setosa   9.4
4          4.6         3.1          1.5         0.2  setosa   9.2
5          5.0         3.6          1.4         0.2  setosa  10.0
6          5.4         3.9          1.7         0.4  setosa  10.8

A new variable Twice has been created. mutate() is typically used inside a pipe (in fact that is its primary mode of use). The next example partitions Sepal.Length into “high” and “low” groups at the mean:

iris %>%
  select(Sepal.Length) %>%
  mutate(Sepal.HL = ifelse(Sepal.Length > mean(Sepal.Length), 1, 2)) %>%
  mutate(Sepal.HL = factor(Sepal.HL, label = c("High", "Low"))) %>%
  head()

  Sepal.Length Sepal.HL
1          5.1      Low
2          4.9      Low
3          4.7      Low
4          4.6      Low
5          5.0      Low
6          5.4      Low

ifelse(condition, value_if_true, value_if_false) is a conditional that returns the second argument when the condition holds and the third otherwise. Here we return 1 when above the mean and 2 otherwise, assign the result to a new variable Sepal.HL via mutate(), and then in a second mutate() overwrite that variable with a factor version of itself. Assigning back to the same variable name is the idiom for type conversion as well (character → numeric, numeric → factor, and so on).

3.6 Exercises 2. select, filter, mutate

Read Baseball.csv into a data frame named df.
The data frame df contains many variables. names(df) returns the variable names. List them.
From df, retain only Year (year), Name (player name), team (team), height, weight, salary, and position. Assign the result to df2.
df2 covers several seasons; we want only the 2020 data. Filter accordingly.
Now filter further to keep only the 2020 Hanshin Tigers records.
Conversely, filter to the 2020 data excluding the Hanshin Tigers.
Create a body-mass-index variable, BMI = weight (kg) / height (m)². Note that height is in centimetres.
Create a new factor variable position2 distinguishing pitchers from fielders. Fielders are everyone who is not a pitcher (infielders, outfielders, and catchers).
Japanese professional baseball is divided into the Central League and the Pacific League. The Central League teams are Giants, Carp, Tigers, Swallows, Dragons, and DeNA; the Pacific League is everything else. Add a factor variable League to df2 accordingly.
The Year variable is stored as a string because each value ends in 年度 (“season”). Strip that suffix and convert the variable to numeric.

3.7 Long and wide formats

So far the data we have seen have been stored as a two-dimensional array of cases × variables. This layout is convenient for humans but not always for machines. Indeed, spreadsheet software is sometimes (in)famously mis-used in Japan in a way nicknamed “kami Excel” (god-like Excel), in which a spreadsheet is treated as graph paper or manuscript-style ruled paper. Visually intuitive, perhaps; but the data become structurally opaque to a computer and very difficult to analyze. Many such datasets are still in circulation.

In response, in December 2020 Japan’s Ministry of Internal Affairs and Communications issued a unified set of rules for machine-readable data layout (総務省 2020), including a checklist:

Is the file format Excel or CSV?
Is there one datum per cell?
Are numeric data stored as numeric (no embedded strings)?
Are there no merged cells?
Are spaces and line breaks not used for cosmetic layout?
Are no column headings omitted?
Are formulas converted to their numeric results?
Are no embedded objects (images, etc.) used?
Are units of measurement recorded?
Are no platform-specific characters used?
Is the data not split into fragments?
Is there only one table per sheet?

The basic rule is: build a single self-contained dataset with one case per row, with neither omissions nor extras.

Independently, Wickham (2014) proposed the notion of tidy data as a canonical form for analysis. Tidy data have four properties:

Each variable forms one column.
Each observation forms one row.
Each type of observational unit forms one table.
Each value forms one cell.

In tidy form, the correspondence between variables and values is unambiguous to a computer and the data are straightforward to analyze. It is no exaggeration to say that the goal of data wrangling is to take messy, irregular data and put them into tidy form.

A subtle but important observation: variable names themselves can also be regarded as a variable. Consider a matrix-style table of the form:

Wide-format data
	Morning	Afternoon	Evening	Night
Tokyo	clear	clear	rain	rain
Osaka	clear	cloudy	clear	clear
Fukuoka	clear	cloudy	cloudy	rain

To read the evening weather in Osaka, one finds the Osaka row and the Evening column. That is, locating a single cell requires referencing both a row label and a column label.

The same data can be rearranged like this:

Long-format data
Region	Time	Weather
Tokyo	Morning	clear
Tokyo	Afternoon	clear
Tokyo	Evening	rain
Tokyo	Night	rain
Osaka	Morning	clear
Osaka	Afternoon	cloudy
Osaka	Evening	clear
Osaka	Night	clear
Fukuoka	Morning	clear
Fukuoka	Afternoon	cloudy
Fukuoka	Evening	cloudy
Fukuoka	Night	rain

The information content is identical, but locating the “Osaka, evening” record requires only row selection — easier for a machine to handle. This second layout is the long format (or “vertical” layout); the original is the wide format (“horizontal” layout).

One advantage of the long format is the handling of missing values. In a wide table, dropping a whole row or column when only one cell is missing is wasteful; targeting both a row and a column for partial deletion is technically awkward. In a long table, the missing record is a single row to be dropped.

The tidyverse (specifically tidyr) provides functions for moving between long and wide layouts. First, pivot_longer() converts wide to long:

iris %>% pivot_longer(-Species)

# A tibble: 600 × 3
   Species name         value
   <fct>   <chr>        <dbl>
 1 setosa  Sepal.Length   5.1
 2 setosa  Sepal.Width    3.5
 3 setosa  Petal.Length   1.4
 4 setosa  Petal.Width    0.2
 5 setosa  Sepal.Length   4.9
 6 setosa  Sepal.Width    3  
 7 setosa  Petal.Length   1.4
 8 setosa  Petal.Width    0.2
 9 setosa  Sepal.Length   4.7
10 setosa  Sepal.Width    3.2
# ℹ 590 more rows

Here we pivot the iris data around Species: all other variables are stacked into a name/value pair.

Conversely, pivot_wider() reshapes long back to wide:

iris %>%
  select(-Species) %>%
  rowid_to_column("ID") %>%
  pivot_longer(-ID) %>%
  pivot_wider(id_cols = ID, names_from = name, values_from = value)

# A tibble: 150 × 5
      ID Sepal.Length Sepal.Width Petal.Length Petal.Width
   <int>        <dbl>       <dbl>        <dbl>       <dbl>
 1     1          5.1         3.5          1.4         0.2
 2     2          4.9         3            1.4         0.2
 3     3          4.7         3.2          1.3         0.2
 4     4          4.6         3.1          1.5         0.2
 5     5          5           3.6          1.4         0.2
 6     6          5.4         3.9          1.7         0.4
 7     7          4.6         3.4          1.4         0.3
 8     8          5           3.4          1.5         0.2
 9     9          4.4         2.9          1.4         0.2
10    10          4.9         3.1          1.5         0.1
# ℹ 140 more rows

Here Species is dropped and a per-row ID column is added. With ID as the key, the variable names come from name and the values from value, restoring the wide layout.²

3.8 Grouping and summary statistics

Once data are in long format, filtering by variable or case is easy. To compute summary statistics for each group of a categorical variable, combine group_by() with summarize() (or reframe()). An example:

iris %>% group_by(Species)

# A tibble: 150 × 5
# Groups:   Species [3]
   Sepal.Length Sepal.Width Petal.Length Petal.Width Species
          <dbl>       <dbl>        <dbl>       <dbl> <fct>  
 1          5.1         3.5          1.4         0.2 setosa 
 2          4.9         3            1.4         0.2 setosa 
 3          4.7         3.2          1.3         0.2 setosa 
 4          4.6         3.1          1.5         0.2 setosa 
 5          5           3.6          1.4         0.2 setosa 
 6          5.4         3.9          1.7         0.4 setosa 
 7          4.6         3.4          1.4         0.3 setosa 
 8          5           3.4          1.5         0.2 setosa 
 9          4.4         2.9          1.4         0.2 setosa 
10          4.9         3.1          1.5         0.1 setosa 
# ℹ 140 more rows

The displayed contents look unchanged, but the output now shows Species[3], indicating that the data are grouped into three levels of Species. Now summarize():

iris %>%
  group_by(Species) %>%
  summarize(
    n = n(),
    Mean = mean(Sepal.Length),
    Max = max(Sepal.Length),
    IQR = IQR(Sepal.Length)
  )

# A tibble: 3 × 5
  Species        n  Mean   Max   IQR
  <fct>      <int> <dbl> <dbl> <dbl>
1 setosa        50  5.01   5.8 0.400
2 versicolor    50  5.94   7   0.7  
3 virginica     50  6.59   7.9 0.675

We compute the sample size (n), mean, maximum, and interquartile range (IQR).³

The example above summarizes only Sepal.Length. To apply the same calculation to multiple numeric variables at once, use across():

iris %>%
  group_by(Species) %>%
  summarize(across(
    c(Sepal.Length, Sepal.Width, Petal.Length),
    ~ mean(.x)
  ))

# A tibble: 3 × 4
  Species    Sepal.Length Sepal.Width Petal.Length
  <fct>             <dbl>       <dbl>        <dbl>
1 setosa             5.01        3.43         1.46
2 versicolor         5.94        2.77         4.26
3 virginica          6.59        2.97         5.55

A note on the ~ mean(.x) syntax. Expressions beginning with a tilde (~) are lambda functions (or lambda expressions) in R — a compact way to define an anonymous function on the fly. The same thing can be written with an explicit function:

iris %>%
  group_by(Species) %>%
  summarize(across(
    c(Sepal.Length, Sepal.Width, Petal.Length),
    function(x) {
      mean(x)
    }
  ))

# A tibble: 3 × 4
  Species    Sepal.Length Sepal.Width Petal.Length
  <fct>             <dbl>       <dbl>        <dbl>
1 setosa             5.01        3.43         1.46
2 versicolor         5.94        2.77         4.26
3 virginica          6.59        2.97         5.55

We will return to functions and lambdas in detail later; for now, focus on the pattern for applying a function across multiple variables. The variable-selection helpers introduced with select() (starts_with(), etc.) also work inside across(). Here is an example combining multi-variable selection with multi-function application; multiple functions are provided as a named list of lambdas.

iris %>%
  group_by(Species) %>%
  summarize(across(starts_with("Sepal"),
    .fns = list(
      M = ~ mean(.x),
      Q1 = ~ quantile(.x, 0.25),
      Q3 = ~ quantile(.x, 0.75)
    )
  ))

# A tibble: 3 × 7
  Species    Sepal.Length_M Sepal.Length_Q1 Sepal.Length_Q3 Sepal.Width_M
  <fct>               <dbl>           <dbl>           <dbl>         <dbl>
1 setosa               5.01            4.8              5.2          3.43
2 versicolor           5.94            5.6              6.3          2.77
3 virginica            6.59            6.22             6.9          2.97
# ℹ 2 more variables: Sepal.Width_Q1 <dbl>, Sepal.Width_Q3 <dbl>

3.9 Exercises 3. Reshaping data

Use the df2 object built above. If it is no longer in your environment, return to the previous exercises and recreate it.
Group by year and compute, for each year, the number of registered players (count of rows) and the mean salary.
Group by year and team and compute the same two statistics for each year–team combination.
Convert the resulting summary to wide format with pivot_wider(), with one row per year and one column per team-variable combination.
Convert the wide-format result back to long format using pivot_longer() with Year as the key.

Wickham, Hadley. 2014. “Tidy Data.” Journal of Statistical Software 59: 1–23. https://doi.org/10.18637/jss.v059.i10.

総務省. 2020. 統計表における機械判別可能なデータ作成に関する表記方法. 統計企画会議申し合わせ. https://www.soumu.go.jp/main_content/000723697.pdf.

Of course one could simply call sd(dat); we expand each step here for illustration. Note also that sd() computes the unbiased (n−1) standard deviation, which differs from the maximum-likelihood (sample) standard deviation shown here.↩︎
Species is dropped because it would not serve as a useful identifier for pivoting back (it has only three levels), and a separate row-level identifier is needed. Note also that Species information is lost in the round-trip: the long-format value column cannot simultaneously hold both character and numeric data. A workaround is to numerically encode the factor with as.numeric() before pivoting.↩︎
The interquartile range is the difference between the 75th and 25th percentiles.↩︎