Goal: Learn how to manipulate tibbles, the tidyverse version of data frames, using the dplyr package

Prerequisites: You should be familiar with these aspects of R: console, packages, data frames, and vectors. It is helpful, but not necessary, to have previous programming experience.

1 Introduction

The tidyverse is a collection of R packages designed for data science. The tidyverse packages use the same grammar and data structures, making it easy to combine functions from different packages.

The core tidyverse packages:

  • tibble: for tibbles, the tidyverse version of data frames
  • dplyr: for data manipulation
  • readr: for data import
  • tidyr: for data tidying
  • stringr: for string manipulation
  • purrr: for functional programming
  • forcats: for factors
  • ggplot2: for data visualization

This quickstart guide focuses on the tibble and dplyr packages, and provides the essential functions to get started on data manipulation. It is based off Chapter 5 ‘Data transformation’ of R for Data Science, by Garrett Grolemund and Hadley Wickham.

 

2 Getting started

2.1 Running R code

There are two types of code blocks in this guide.

# This is an example of R code.
1 + 2
# This is an example of output.
3

The first type of block, with a gray background, is R code. You can copy and run such code blocks in the console of your R editor. You can also download the quickstart-companion R script, which contains all of the R code in this guide.

The second type of block, with a white background, is the output of running whatever R code is above it.

2.2 Installation

Install and load the tidyverse:

# Install the tidyverse
install.packages("tidyverse")

# Load the tidyverse
library(tidyverse)

Install and load the nycflights13 package, which is a data set that contains information on flights that departed from New York City in 2013. We will use the nycflights13::flights data frame to explore the tibble and dplyr packages.

install.packages("nycflights13")
library(nycflights13)

 

3 tibble

The tbl_df class, or “tibble”, is the most important part of the tibble package. Tibbles are the tidyverse version of the base R data.frame class, and tidyverse functions are designed to work with tibbles.

A tibble is stricter than a data.frame. For example, subsetting a tibble always returns another tibble. In comparison, subsetting a data.frame sometimes returns a data.frame and sometimes returns a vector.

This guide uses the terms “tibble” and “data frame” interchangeably.

Learn more about tibbles by running vignette("tibble") in your console, or by reading the CRAN documentation on tibbles.

3.1 Printing a tibble

Print the nycflights13::flights, or simply flights, data frame:

flights
# A tibble: 336,776 x 19
    year month   day dep_time sched_dep_time dep_delay arr_time
   <int> <int> <int>    <int>          <int>     <dbl>    <int>
 1  2013     1     1      517            515         2      830
 2  2013     1     1      533            529         4      850
 3  2013     1     1      542            540         2      923
 4  2013     1     1      544            545        -1     1004
 5  2013     1     1      554            600        -6      812
 6  2013     1     1      554            558        -4      740
 7  2013     1     1      555            600        -5      913
 8  2013     1     1      557            600        -3      709
 9  2013     1     1      557            600        -3      838
10  2013     1     1      558            600        -2      753
# … with 336,766 more rows, and 12 more variables: sched_arr_time <int>,
#   arr_delay <dbl>, carrier <chr>, flight <int>, tailnum <chr>,
#   origin <chr>, dest <chr>, air_time <dbl>, distance <dbl>, hour <dbl>,
#   minute <dbl>, time_hour <dttm>

When a tibble is printed, it displays the first 10 rows and as many columns as will fit on the console. At the top, it shows the total number of rows and columns, and provides the type of each column (also called a “variable”). Tibbles use different font styles and colors to style columns and make certain values more visually distinct. This formatting is not reproduced in the output code block above.

If you print flights in an R editor like R Studio, the tibble may look like Figure 1 below, where negative values (as in the dep_delay column) are printed in red, and meta data is printed in gray.

Formatted tibble
Figure 1: flights printed in R Studio

 

4 dplyr

All examples in this guide use the flights data frame. Run ?flights in your console to read its documentation.

dplyr is a collection of functions for manipulating data frames.

Important dplyr functions:

  • select(): subsets a data frame by picking columns
  • filter(): subsets a data frame by picking rows
  • mutate(): creates new columns or modifies existing columns
  • arrange(): reorder a data frame by row values
  • summarize(): summarize a data frame
  • group_by(): groups a data frame by row vaues

These functions have the same structure:

  • Input: the first argument is a data frame; all other arguments describe what to do with the data frame
  • Output: a new data frame

Since dplyr functions return a new data frame, and do not overwrite the original data frame, use the assignment operator <- to save the output of a dplyr function:

flights

flights_modified <- some_dplyr_function(flights, argument2, argument3)

4.1 Select columns with select()

select() subsets a data frame by columns.

  • First argument: tibble (tbl.df)
  • Subsequent arguments: column names without quotation marks

Columns that are explicitly named are kept, and all other columns are dropped.

4.1.1 Examples of select()

# Select columns by name
select(flights, year, month, day, dep_delay, arr_delay)
# A tibble: 336,776 x 5
    year month   day dep_delay arr_delay
   <int> <int> <int>     <dbl>     <dbl>
 1  2013     1     1         2        11
 2  2013     1     1         4        20
 3  2013     1     1         2        33
 4  2013     1     1        -1       -18
 5  2013     1     1        -6       -25
 6  2013     1     1        -4        12
 7  2013     1     1        -5        19
 8  2013     1     1        -3       -14
 9  2013     1     1        -3        -8
10  2013     1     1        -2         8
# … with 336,766 more rows
# Select columns from `year` to `arr_time` (inclusive)
select(flights, year:arr_time)
# A tibble: 336,776 x 7
    year month   day dep_time sched_dep_time dep_delay arr_time
   <int> <int> <int>    <int>          <int>     <dbl>    <int>
 1  2013     1     1      517            515         2      830
 2  2013     1     1      533            529         4      850
 3  2013     1     1      542            540         2      923
 4  2013     1     1      544            545        -1     1004
 5  2013     1     1      554            600        -6      812
 6  2013     1     1      554            558        -4      740
 7  2013     1     1      555            600        -5      913
 8  2013     1     1      557            600        -3      709
 9  2013     1     1      557            600        -3      838
10  2013     1     1      558            600        -2      753
# … with 336,766 more rows
# Select all columns except the columns from `year` to `arr_time` (inclusive)
select(flights, -(year:arr_time))
# A tibble: 336,776 x 12
   sched_arr_time arr_delay carrier flight tailnum origin dest  air_time
            <int>     <dbl> <chr>    <int> <chr>   <chr>  <chr>    <dbl>
 1            819        11 UA        1545 N14228  EWR    IAH        227
 2            830        20 UA        1714 N24211  LGA    IAH        227
 3            850        33 AA        1141 N619AA  JFK    MIA        160
 4           1022       -18 B6         725 N804JB  JFK    BQN        183
 5            837       -25 DL         461 N668DN  LGA    ATL        116
 6            728        12 UA        1696 N39463  EWR    ORD        150
 7            854        19 B6         507 N516JB  EWR    FLL        158
 8            723       -14 EV        5708 N829AS  LGA    IAD         53
 9            846        -8 B6          79 N593JB  JFK    MCO        140
10            745         8 AA         301 N3ALAA  LGA    ORD        138
# … with 336,766 more rows, and 4 more variables: distance <dbl>,
#   hour <dbl>, minute <dbl>, time_hour <dttm>

4.1.2 select() helper functions

select() can take helper functions as arguments.

Examples of helper functions:

  • everything(): selects all columns that have not already been selected
  • contains("abc"): selects columns with a name that contains “abc”

4.1.2.1 Examples of select() helper functions

# Rearrange data frame by moving certain columns to the front and moving all other columns to the back
select(flights, dep_delay, arr_delay, everything())
# A tibble: 336,776 x 19
   dep_delay arr_delay  year month   day dep_time sched_dep_time arr_time
       <dbl>     <dbl> <int> <int> <int>    <int>          <int>    <int>
 1         2        11  2013     1     1      517            515      830
 2         4        20  2013     1     1      533            529      850
 3         2        33  2013     1     1      542            540      923
 4        -1       -18  2013     1     1      544            545     1004
 5        -6       -25  2013     1     1      554            600      812
 6        -4        12  2013     1     1      554            558      740
 7        -5        19  2013     1     1      555            600      913
 8        -3       -14  2013     1     1      557            600      709
 9        -3        -8  2013     1     1      557            600      838
10        -2         8  2013     1     1      558            600      753
# … with 336,766 more rows, and 11 more variables: sched_arr_time <int>,
#   carrier <chr>, flight <int>, tailnum <chr>, origin <chr>, dest <chr>,
#   air_time <dbl>, distance <dbl>, hour <dbl>, minute <dbl>,
#   time_hour <dttm>
# Select all columns with a name that contains the string "time"
select(flights, contains("time"))
# A tibble: 336,776 x 6
   dep_time sched_dep_time arr_time sched_arr_time air_time
      <int>          <int>    <int>          <int>    <dbl>
 1      517            515      830            819      227
 2      533            529      850            830      227
 3      542            540      923            850      160
 4      544            545     1004           1022      183
 5      554            600      812            837      116
 6      554            558      740            728      150
 7      555            600      913            854      158
 8      557            600      709            723       53
 9      557            600      838            846      140
10      558            600      753            745      138
# … with 336,766 more rows, and 1 more variable: time_hour <dttm>

4.1.3 Exercises

  1. Use select() on flights to move the year column to the back.
    Click for Answer
     select(flights, -year, year)
     # A tibble: 336,776 x 19
        month   day dep_time sched_dep_time dep_delay arr_time sched_arr_time
        <int> <int>    <int>          <int>     <dbl>    <int>          <int>
      1     1     1      517            515         2      830            819
      2     1     1      533            529         4      850            830
      3     1     1      542            540         2      923            850
      4     1     1      544            545        -1     1004           1022
      5     1     1      554            600        -6      812            837
      6     1     1      554            558        -4      740            728
      7     1     1      555            600        -5      913            854
      8     1     1      557            600        -3      709            723
      9     1     1      557            600        -3      838            846
     10     1     1      558            600        -2      753            745
     # … with 336,766 more rows, and 12 more variables: arr_delay <dbl>,
     #   carrier <chr>, flight <int>, tailnum <chr>, origin <chr>, dest <chr>,
     #   air_time <dbl>, distance <dbl>, hour <dbl>, minute <dbl>,
     #   time_hour <dttm>, year <int>
  2. Use select() on flights to select the first three columns, and any column that contains the string "arr".
    Click for Answer
     select(flights, year:day, contains("arr"))
     # A tibble: 336,776 x 7
         year month   day arr_time sched_arr_time arr_delay carrier
        <int> <int> <int>    <int>          <int>     <dbl> <chr>  
      1  2013     1     1      830            819        11 UA     
      2  2013     1     1      850            830        20 UA     
      3  2013     1     1      923            850        33 AA     
      4  2013     1     1     1004           1022       -18 B6     
      5  2013     1     1      812            837       -25 DL     
      6  2013     1     1      740            728        12 UA     
      7  2013     1     1      913            854        19 B6     
      8  2013     1     1      709            723       -14 EV     
      9  2013     1     1      838            846        -8 B6     
     10  2013     1     1      753            745         8 AA     
     # … with 336,766 more rows

4.2 Filter rows with filter()

filter() subsets a data frame by rows.

  • First argument: tibble (tbl.df)
  • Subsequent arguments: logical predicates

The comparison operators >, >=, <, <=, !=, and == create predicates, and the logical operators & (AND), | (OR), and ! (NOT) join predicates.

Rows where the predicate evaluates to TRUE are kept, and all other rows are dropped.

4.2.1 Examples of filter()

# Pick out flights that departed on January 1st
filter(flights, month == 1 & day == 1)
# A tibble: 842 x 19
    year month   day dep_time sched_dep_time dep_delay arr_time
   <int> <int> <int>    <int>          <int>     <dbl>    <int>
 1  2013     1     1      517            515         2      830
 2  2013     1     1      533            529         4      850
 3  2013     1     1      542            540         2      923
 4  2013     1     1      544            545        -1     1004
 5  2013     1     1      554            600        -6      812
 6  2013     1     1      554            558        -4      740
 7  2013     1     1      555            600        -5      913
 8  2013     1     1      557            600        -3      709
 9  2013     1     1      557            600        -3      838
10  2013     1     1      558            600        -2      753
# … with 832 more rows, and 12 more variables: sched_arr_time <int>,
#   arr_delay <dbl>, carrier <chr>, flight <int>, tailnum <chr>,
#   origin <chr>, dest <chr>, air_time <dbl>, distance <dbl>, hour <dbl>,
#   minute <dbl>, time_hour <dttm>
# Pick out flights that departed in January, but not flights that departed on January 1st
filter(flights, month == 1 & day != 1)
# A tibble: 26,162 x 19
    year month   day dep_time sched_dep_time dep_delay arr_time
   <int> <int> <int>    <int>          <int>     <dbl>    <int>
 1  2013     1     2       42           2359        43      518
 2  2013     1     2      126           2250       156      233
 3  2013     1     2      458            500        -2      703
 4  2013     1     2      512            515        -3      809
 5  2013     1     2      535            540        -5      831
 6  2013     1     2      536            529         7      840
 7  2013     1     2      539            545        -6      959
 8  2013     1     2      554            600        -6      845
 9  2013     1     2      554            600        -6      841
10  2013     1     2      554            600        -6      909
# … with 26,152 more rows, and 12 more variables: sched_arr_time <int>,
#   arr_delay <dbl>, carrier <chr>, flight <int>, tailnum <chr>,
#   origin <chr>, dest <chr>, air_time <dbl>, distance <dbl>, hour <dbl>,
#   minute <dbl>, time_hour <dttm>
# Pick out flights that departed in January or in December
filter(flights, month == 1 | month == 12)
# A tibble: 55,139 x 19
    year month   day dep_time sched_dep_time dep_delay arr_time
   <int> <int> <int>    <int>          <int>     <dbl>    <int>
 1  2013     1     1      517            515         2      830
 2  2013     1     1      533            529         4      850
 3  2013     1     1      542            540         2      923
 4  2013     1     1      544            545        -1     1004
 5  2013     1     1      554            600        -6      812
 6  2013     1     1      554            558        -4      740
 7  2013     1     1      555            600        -5      913
 8  2013     1     1      557            600        -3      709
 9  2013     1     1      557            600        -3      838
10  2013     1     1      558            600        -2      753
# … with 55,129 more rows, and 12 more variables: sched_arr_time <int>,
#   arr_delay <dbl>, carrier <chr>, flight <int>, tailnum <chr>,
#   origin <chr>, dest <chr>, air_time <dbl>, distance <dbl>, hour <dbl>,
#   minute <dbl>, time_hour <dttm>

When writing a predicate, pay attention to operator precedence. In particular, the comparison operators have higher precedence than the logical operators.

4.2.2 Exercises

For the following exercises, create a new data frame called flights2.

flights2 <- select(flights, tailnum, starts_with("dep_"), starts_with("arr_"), distance)
  1. Use filter() on flights2 to find all flights that traveled 2000 or more miles.
    Click for Answer
     filter(flights2, distance >= 2000)
     # A tibble: 51,695 x 6
        tailnum dep_time dep_delay arr_time arr_delay distance
        <chr>      <int>     <dbl>    <int>     <dbl>    <dbl>
      1 N29129       558        -2      924         7     2475
      2 N53441       558        -2      923       -14     2565
      3 N76515       559        -1      854        -8     2227
      4 N532UA       611        11      945        14     2586
      5 N807AW       622        -8     1017         3     2133
      6 N535UW       627        -3     1018         0     2153
      7 N33289       628        -2     1016        29     2454
      8 N38727       646         1     1023        -7     2434
      9 N558JB       651        -4      936        -6     2248
     10 N705TW       655        -5     1037        -8     2586
     # … with 51,685 more rows
  2. Use filter() on flights2 to find all flights that departed early or more than 1 hour late.
    Click for Answer
     filter(flights2, dep_delay < 0 | dep_delay > 60)
     # A tibble: 210,156 x 6
        tailnum dep_time dep_delay arr_time arr_delay distance
        <chr>      <int>     <dbl>    <int>     <dbl>    <dbl>
      1 N804JB       544        -1     1004       -18     1576
      2 N668DN       554        -6      812       -25      762
      3 N39463       554        -4      740        12      719
      4 N516JB       555        -5      913        19     1065
      5 N829AS       557        -3      709       -14      229
      6 N593JB       557        -3      838        -8      944
      7 N3ALAA       558        -2      753         8      733
      8 N793JB       558        -2      849        -2     1028
      9 N657JB       558        -2      853        -3     1005
     10 N29129       558        -2      924         7     2475
     # … with 210,146 more rows

4.3 Create new columns with mutate()

mutate() creates new columns or modifies existing columns.

  • First argument: tibble (tbl.df)
  • Subsequent arguments: expressions of the form col_name = col_value.

If a column with the name col_name does not exist, then a new column is added to the end of the data frame. Its name is col_name, and its values are determined by col_value.

If a column with the name col_name already exists, then no new column is added and the col_name column keeps its original postion. All of the values in col_name are modified according to col_value.

4.3.1 Examples of mutate()

mutate() adds new columns to the end of a data frame, so newly created columns may not be displayed in the console. Create a narrower data frame called flights_narrow by using select():

flights_narrow <- select(flights, year:day, distance, air_time, contains("delay"))

The following examples use the flights_narrow data frame.

# Create a column called `tot_delay`. Each value is a sum of the`dep_delay` and `arr_delay` values in the same row.
mutate(flights_narrow, tot_delay = dep_delay + arr_delay)
# A tibble: 336,776 x 8
    year month   day distance air_time dep_delay arr_delay tot_delay
   <int> <int> <int>    <dbl>    <dbl>     <dbl>     <dbl>     <dbl>
 1  2013     1     1     1400      227         2        11        13
 2  2013     1     1     1416      227         4        20        24
 3  2013     1     1     1089      160         2        33        35
 4  2013     1     1     1576      183        -1       -18       -19
 5  2013     1     1      762      116        -6       -25       -31
 6  2013     1     1      719      150        -4        12         8
 7  2013     1     1     1065      158        -5        19        14
 8  2013     1     1      229       53        -3       -14       -17
 9  2013     1     1      944      140        -3        -8       -11
10  2013     1     1      733      138        -2         8         6
# … with 336,766 more rows
# Create multiple columns at once
mutate(flights_narrow, hours = air_time / 60, speed = distance / hours)
# A tibble: 336,776 x 9
    year month   day distance air_time dep_delay arr_delay hours speed
   <int> <int> <int>    <dbl>    <dbl>     <dbl>     <dbl> <dbl> <dbl>
 1  2013     1     1     1400      227         2        11 3.78   370.
 2  2013     1     1     1416      227         4        20 3.78   374.
 3  2013     1     1     1089      160         2        33 2.67   408.
 4  2013     1     1     1576      183        -1       -18 3.05   517.
 5  2013     1     1      762      116        -6       -25 1.93   394.
 6  2013     1     1      719      150        -4        12 2.5    288.
 7  2013     1     1     1065      158        -5        19 2.63   404.
 8  2013     1     1      229       53        -3       -14 0.883  259.
 9  2013     1     1      944      140        -3        -8 2.33   405.
10  2013     1     1      733      138        -2         8 2.3    319.
# … with 336,766 more rows
# Modify an existing column
mutate(flights_narrow, distance = 10 - 2)
# A tibble: 336,776 x 7
    year month   day distance air_time dep_delay arr_delay
   <int> <int> <int>    <dbl>    <dbl>     <dbl>     <dbl>
 1  2013     1     1        8      227         2        11
 2  2013     1     1        8      227         4        20
 3  2013     1     1        8      160         2        33
 4  2013     1     1        8      183        -1       -18
 5  2013     1     1        8      116        -6       -25
 6  2013     1     1        8      150        -4        12
 7  2013     1     1        8      158        -5        19
 8  2013     1     1        8       53        -3       -14
 9  2013     1     1        8      140        -3        -8
10  2013     1     1        8      138        -2         8
# … with 336,766 more rows

4.3.2 Vectorized functions

In the examples above, mutate() used the arithmetic operators (+, -, *, /) to create new columns. mutate() can use any vectorized operator or function to create a new column. A vectorized operator/function takes a vector of values as input and returns as ouput a vector of the same size.

Examples of vectorized operators/functions :

  • %/%: integer division operator
  • %/%: modulo operator
  • cumsum(x): calculates the cumulative sum of the elements in x

4.3.2.1 Examples of vectorized functions

# Create columns that separate `air_time` into hours and minutes
mutate(flights_narrow, at_hr = air_time %/% 60, at_min = air_time %% 60)
# A tibble: 336,776 x 9
    year month   day distance air_time dep_delay arr_delay at_hr at_min
   <int> <int> <int>    <dbl>    <dbl>     <dbl>     <dbl> <dbl>  <dbl>
 1  2013     1     1     1400      227         2        11     3     47
 2  2013     1     1     1416      227         4        20     3     47
 3  2013     1     1     1089      160         2        33     2     40
 4  2013     1     1     1576      183        -1       -18     3      3
 5  2013     1     1      762      116        -6       -25     1     56
 6  2013     1     1      719      150        -4        12     2     30
 7  2013     1     1     1065      158        -5        19     2     38
 8  2013     1     1      229       53        -3       -14     0     53
 9  2013     1     1      944      140        -3        -8     2     20
10  2013     1     1      733      138        -2         8     2     18
# … with 336,766 more rows
# Create a column that cumulatively sums the arrival delays
mutate(flights_narrow, cum_arr_delay = cumsum(arr_delay))
# A tibble: 336,776 x 8
    year month   day distance air_time dep_delay arr_delay cum_arr_delay
   <int> <int> <int>    <dbl>    <dbl>     <dbl>     <dbl>         <dbl>
 1  2013     1     1     1400      227         2        11            11
 2  2013     1     1     1416      227         4        20            31
 3  2013     1     1     1089      160         2        33            64
 4  2013     1     1     1576      183        -1       -18            46
 5  2013     1     1      762      116        -6       -25            21
 6  2013     1     1      719      150        -4        12            33
 7  2013     1     1     1065      158        -5        19            52
 8  2013     1     1      229       53        -3       -14            38
 9  2013     1     1      944      140        -3        -8            30
10  2013     1     1      733      138        -2         8            38
# … with 336,766 more rows

4.3.3 Exercises

  1. Use select() on flights to create a new data frame called flights_c that contains only character type columns (Hint: use glimpse(flights)). Then, use mutate() on flights_c to create a column called trip. If a plane flies from EWR to IAH, the value of trip should be "EWR to IAH".
    Click for Answer
     flights_c <- select(flights, carrier, tailnum, origin, dest)
     mutate(flights_c, trip = str_c(origin, " to ", dest))
     # A tibble: 336,776 x 5
        carrier tailnum origin dest  trip      
        <chr>   <chr>   <chr>  <chr> <chr>     
      1 UA      N14228  EWR    IAH   EWR to IAH
      2 UA      N24211  LGA    IAH   LGA to IAH
      3 AA      N619AA  JFK    MIA   JFK to MIA
      4 B6      N804JB  JFK    BQN   JFK to BQN
      5 DL      N668DN  LGA    ATL   LGA to ATL
      6 UA      N39463  EWR    ORD   EWR to ORD
      7 B6      N516JB  EWR    FLL   EWR to FLL
      8 EV      N829AS  LGA    IAD   LGA to IAD
      9 B6      N593JB  JFK    MCO   JFK to MCO
     10 AA      N3ALAA  LGA    ORD   LGA to ORD
     # … with 336,766 more rows
     # Alternatively:
     # mutate(flights_c, trip = str_c(origin, dest, sep = " to "))
  2. Use mutate() on flights_narrow to create a column called tot_delay, which calculates the total delay of each flight. Then, create a column called prop_dep_delay, which calculates the proportion of each flight’s departure delay to its total delay.
    Click for Answer
     mutate(flights_narrow, tot_delay = dep_delay + arr_delay,
            prop_dep_delay = dep_delay / tot_delay)
     # A tibble: 336,776 x 9
         year month   day distance air_time dep_delay arr_delay tot_delay
        <int> <int> <int>    <dbl>    <dbl>     <dbl>     <dbl>     <dbl>
      1  2013     1     1     1400      227         2        11        13
      2  2013     1     1     1416      227         4        20        24
      3  2013     1     1     1089      160         2        33        35
      4  2013     1     1     1576      183        -1       -18       -19
      5  2013     1     1      762      116        -6       -25       -31
      6  2013     1     1      719      150        -4        12         8
      7  2013     1     1     1065      158        -5        19        14
      8  2013     1     1      229       53        -3       -14       -17
      9  2013     1     1      944      140        -3        -8       -11
     10  2013     1     1      733      138        -2         8         6
     # … with 336,766 more rows, and 1 more variable: prop_dep_delay <dbl>

4.4 Reorder rows with arrange()

arrange() reorders the rows of a data frame.

  • First argument: tibble (tbl.df)
  • Subsequent arguments: column names without quotation marks

Values in an argument column are sorted in ascending order by default. Wrap a column name in desc() to sort in descending order.

4.4.1 Examples of arrange()

# Order flights from least departure delay to most
arrange(flights, dep_delay)
# A tibble: 336,776 x 19
    year month   day dep_time sched_dep_time dep_delay arr_time
   <int> <int> <int>    <int>          <int>     <dbl>    <int>
 1  2013    12     7     2040           2123       -43       40
 2  2013     2     3     2022           2055       -33     2240
 3  2013    11    10     1408           1440       -32     1549
 4  2013     1    11     1900           1930       -30     2233
 5  2013     1    29     1703           1730       -27     1947
 6  2013     8     9      729            755       -26     1002
 7  2013    10    23     1907           1932       -25     2143
 8  2013     3    30     2030           2055       -25     2213
 9  2013     3     2     1431           1455       -24     1601
10  2013     5     5      934            958       -24     1225
# … with 336,766 more rows, and 12 more variables: sched_arr_time <int>,
#   arr_delay <dbl>, carrier <chr>, flight <int>, tailnum <chr>,
#   origin <chr>, dest <chr>, air_time <dbl>, distance <dbl>, hour <dbl>,
#   minute <dbl>, time_hour <dttm>
# Order flights from most departure delay to least
arrange(flights, desc(dep_delay))
# A tibble: 336,776 x 19
    year month   day dep_time sched_dep_time dep_delay arr_time
   <int> <int> <int>    <int>          <int>     <dbl>    <int>
 1  2013     1     9      641            900      1301     1242
 2  2013     6    15     1432           1935      1137     1607
 3  2013     1    10     1121           1635      1126     1239
 4  2013     9    20     1139           1845      1014     1457
 5  2013     7    22      845           1600      1005     1044
 6  2013     4    10     1100           1900       960     1342
 7  2013     3    17     2321            810       911      135
 8  2013     6    27      959           1900       899     1236
 9  2013     7    22     2257            759       898      121
10  2013    12     5      756           1700       896     1058
# … with 336,766 more rows, and 12 more variables: sched_arr_time <int>,
#   arr_delay <dbl>, carrier <chr>, flight <int>, tailnum <chr>,
#   origin <chr>, dest <chr>, air_time <dbl>, distance <dbl>, hour <dbl>,
#   minute <dbl>, time_hour <dttm>

4.5 Summarize data frames with summarize()

summarize() summarizes a data frame into a single row.

  • First argument: tibble (tbl.df)
  • Subsequent arguments: expressions of the form col_name = summary_function()

Examples of summary functions:

  • mean(x): calculates the arithmetic mean of the values in x
  • n_distinct(x): counts the number of unique values in x

Most summary functions take an optional na.rm argument, which stands for “remove missing (NA) values”. The flights data frame contains many missing values, so all of the examples of summary functions in this guide will include the na.rm = TRUE argument.

4.5.1 Examples of summarize()

# Calculate the average departure delay
summarize(flights, avg_delay = mean(dep_delay, na.rm = TRUE))
# A tibble: 1 x 1
  avg_delay
      <dbl>
1      12.6
# Count the number unique departure times
summarize(flights, unique_dep_times = n_distinct(dep_time, na.rm = TRUE))
# A tibble: 1 x 1
  unique_dep_times
             <int>
1             1318

summarize() is more useful when it is combined with group_by(). To learn more, read Grouped summaries.

4.6 Group a data frame with group_by()

group_by() groups a data frame by row values.

  • First argument: tibble (tbl.df)
  • Subsequent argument: column names without quotation marks

group_by() does not change the row order of a data frame, so it may seem that calling group_by() has no effect. However, a grouped data frame’s meta data header displays the groups and their count.

The function ungroup(df) removes the groups of a grouped data frame df.

4.6.1 Examples of group_by() and ungroup()

# Group the data frame by month (12 groups, one for each month)
grouped_df <- group_by(flights, month)
grouped_df
# A tibble: 336,776 x 19
# Groups:   month [12]
    year month   day dep_time sched_dep_time dep_delay arr_time
   <int> <int> <int>    <int>          <int>     <dbl>    <int>
 1  2013     1     1      517            515         2      830
 2  2013     1     1      533            529         4      850
 3  2013     1     1      542            540         2      923
 4  2013     1     1      544            545        -1     1004
 5  2013     1     1      554            600        -6      812
 6  2013     1     1      554            558        -4      740
 7  2013     1     1      555            600        -5      913
 8  2013     1     1      557            600        -3      709
 9  2013     1     1      557            600        -3      838
10  2013     1     1      558            600        -2      753
# … with 336,766 more rows, and 12 more variables: sched_arr_time <int>,
#   arr_delay <dbl>, carrier <chr>, flight <int>, tailnum <chr>,
#   origin <chr>, dest <chr>, air_time <dbl>, distance <dbl>, hour <dbl>,
#   minute <dbl>, time_hour <dttm>
# Ungroup a grouped data frame
ungroup(grouped_df)
# A tibble: 336,776 x 19
    year month   day dep_time sched_dep_time dep_delay arr_time
   <int> <int> <int>    <int>          <int>     <dbl>    <int>
 1  2013     1     1      517            515         2      830
 2  2013     1     1      533            529         4      850
 3  2013     1     1      542            540         2      923
 4  2013     1     1      544            545        -1     1004
 5  2013     1     1      554            600        -6      812
 6  2013     1     1      554            558        -4      740
 7  2013     1     1      555            600        -5      913
 8  2013     1     1      557            600        -3      709
 9  2013     1     1      557            600        -3      838
10  2013     1     1      558            600        -2      753
# … with 336,766 more rows, and 12 more variables: sched_arr_time <int>,
#   arr_delay <dbl>, carrier <chr>, flight <int>, tailnum <chr>,
#   origin <chr>, dest <chr>, air_time <dbl>, distance <dbl>, hour <dbl>,
#   minute <dbl>, time_hour <dttm>

 

5 The pipe operator %>%

The pipe operator %>%, or “the pipe”, is a part of the magrittr package, a sub-package within dplyr. It takes whatever is on its left and uses it as the first argument of the function on its right.

Keyboard shortcut: CMD + SHIFT + M (Mac) or CTL + SHIFT + M (Windows)

5.1 Without the pipe

dplyr functions do not modify the data frame they take as input, and return a new data frame instead. In traditional R, this means that using multiple dplyr functions to transform a single data frame involves intermediate data frames.

The output of each dplyr function must be saved as a new, intermediate, data frame that the next function takes as input.

5.1.1 Example of intermediate data frames

Rank total flight delay from greatest to least, but only for flights that departed from JFK. Keep only the carrier, dest, distance, air_time, dep_delay, arr_delay and tot_delay columns.

# 1. Pick out flights that departed from JFK
step1 <- filter(flights, origin == "JFK")

# 2. Create a column for total delay times
step2 <- mutate(step1, tot_delay = dep_delay + arr_delay)

# 3. Order flights from most total delay to least
step3 <- arrange(step2, desc(tot_delay))

# 4. Keep only the desired columns
step4 <- select(step3, carrier, dest, distance, air_time, dep_delay, arr_delay, tot_delay)

# View transformed data frame
step4
# A tibble: 111,279 x 7
   carrier dest  distance air_time dep_delay arr_delay tot_delay
   <chr>   <chr>    <dbl>    <dbl>     <dbl>     <dbl>     <dbl>
 1 HA      HNL       4983      640      1301      1272      2573
 2 MQ      CMH        483       74      1137      1127      2264
 3 AA      SFO       2586      354      1014      1007      2021
 4 MQ      CVG        589       96      1005       989      1994
 5 DL      TPA       1005      139       960       931      1891
 6 DL      PDX       2454      313       899       850      1749
 7 AA      LAS       2248      323       853       852      1705
 8 MQ      BWI        184       41       853       851      1704
 9 DL      TPA       1005      173       825       856      1681
10 DL      LAX       2475      335       800       784      1584
# … with 111,269 more rows

The number of intermediate data frames quickly increases as the desired manipulation becomes more complex, or as more data frames are involved. Naming these intermediate data frames and keeping them organized can slow down data manipulation, and greatly increases the length of your code.

5.2 With the pipe

The pipe takes whatever is on its left and uses it as the first argument of the function on its right. Since dplyr functions always return a data frame and take a data frame as its first argument, the pipe can pass a data frame from function to function without using intermediate data frames.

5.2.1 Example of %>%

# This code is equivalent to the code in section 5.1.1
just_1_step <- flights %>%
  filter(origin == "JFK") %>% 
  mutate(tot_delay = dep_delay + arr_delay) %>% 
  arrange(desc(tot_delay)) %>% 
  select(carrier, dest, distance, air_time, dep_delay, arr_delay, tot_delay)

just_1_step
# A tibble: 111,279 x 7
   carrier dest  distance air_time dep_delay arr_delay tot_delay
   <chr>   <chr>    <dbl>    <dbl>     <dbl>     <dbl>     <dbl>
 1 HA      HNL       4983      640      1301      1272      2573
 2 MQ      CMH        483       74      1137      1127      2264
 3 AA      SFO       2586      354      1014      1007      2021
 4 MQ      CVG        589       96      1005       989      1994
 5 DL      TPA       1005      139       960       931      1891
 6 DL      PDX       2454      313       899       850      1749
 7 AA      LAS       2248      323       853       852      1705
 8 MQ      BWI        184       41       853       851      1704
 9 DL      TPA       1005      173       825       856      1681
10 DL      LAX       2475      335       800       784      1584
# … with 111,269 more rows

5.3 Exercise

  1. Use %>% and dplyr functions on flights to do the following:
  • Keep only the flights with a departure delay of more than 2 hours, and only the columns origin, dest, and dep_delay
  • Create a new column called dep_delay2 whose value is the string "x hrs and y min", where x is the number of hours and y is the number of minutes of departure delay.
    • If the value of dep_delay is 134, the value of dep_delay2 should be "2 hrs and 14 min".
    • Hint: use as.character() and str_c()
      Click for Answer
      flights %>% 
        filter(dep_delay > 120) %>% 
        select(origin, dest, dep_delay, dep_delay) %>% 
        mutate(dep_delay2 = str_c(as.character(dep_delay %/% 60), " hrs and ", as.character(dep_delay%%60), " min"))  
      # A tibble: 9,723 x 4
         origin dest  dep_delay dep_delay2       
         <chr>  <chr>     <dbl> <chr>            
       1 JFK    BWI         853 14 hrs and 13 min
       2 EWR    BOS         144 2 hrs and 24 min 
       3 LGA    IAH         134 2 hrs and 14 min 
       4 JFK    SJU         122 2 hrs and 2 min  
       5 EWR    OMA         290 4 hrs and 50 min 
       6 EWR    BTV         260 4 hrs and 20 min 
       7 JFK    LAX         131 2 hrs and 11 min 
       8 JFK    BNA         129 2 hrs and 9 min  
       9 EWR    RIC         155 2 hrs and 35 min 
      10 JFK    DCA         157 2 hrs and 37 min 
      # … with 9,713 more rows

 

6 Grouped dplyr functions

The pipe makes it easy to apply multiple functions to a single data frame. It is particularly useful to create a grouped data frame wih group_by(), and then apply some other dplyr function to every group.

6.1 Grouped summaries

When summarize() takes a grouped data frame as its first argument, the summary function arguments are applied to each group separately. The output data frame has as many rows as there are groups in the input data frame.

6.1.1 Examples of grouped summaries

# Calculate the average departure delay, by month
flights %>% 
  group_by(month) %>% 
  summarize(avg_delay = mean(dep_delay, na.rm = TRUE))
# A tibble: 12 x 2
   month avg_delay
   <int>     <dbl>
 1     1     10.0 
 2     2     10.8 
 3     3     13.2 
 4     4     13.9 
 5     5     13.0 
 6     6     20.8 
 7     7     21.7 
 8     8     12.6 
 9     9      6.72
10    10      6.24
11    11      5.44
12    12     16.6 
# Validate the grouped summary by calculating just the average departure delay for June
flights %>% 
  filter(month == 6) %>% 
  summarize(avg_delay = mean(dep_delay, na.rm = TRUE))
# A tibble: 1 x 1
  avg_delay
      <dbl>
1      20.8
# Calculate the total departure delay by origin and destination, then order rows from most total delay to least
flights %>% 
  group_by(origin, dest) %>% 
  summarize(dep_delay = sum(dep_delay, na.rm = TRUE)) %>% 
  arrange(desc(dep_delay))
# A tibble: 224 x 3
# Groups:   origin [3]
   origin dest  dep_delay
   <chr>  <chr>     <dbl>
 1 LGA    ATL      115425
 2 JFK    SFO       97271
 3 JFK    LAX       95418
 4 LGA    ORD       91812
 5 EWR    ORD       85683
 6 EWR    ATL       75819
 7 EWR    SFO       72950
 8 JFK    BOS       67667
 9 EWR    BOS       65877
10 JFK    MCO       57609
# … with 214 more rows

6.1.2 Grouped summary exercises

  1. Use group_by() and summarize() to find the carrier with the best on-time average (lowest average arrival delay).
    Click for Answer
     flights %>% 
       group_by(carrier) %>% 
       summarize(avg_delay = mean(arr_delay, na.rm = TRUE)) %>% 
       arrange(avg_delay)
     # A tibble: 16 x 2
        carrier avg_delay
        <chr>       <dbl>
      1 AS         -9.93 
      2 HA         -6.92 
      3 AA          0.364
      4 DL          1.64 
      5 VX          1.76 
      6 US          2.13 
      7 UA          3.56 
      8 9E          7.38 
      9 B6          9.46 
     10 WN          9.65 
     11 MQ         10.8  
     12 OO         11.9  
     13 YV         15.6  
     14 EV         15.8  
     15 FL         20.1  
     16 F9         21.9  

6.2 Grouped filters and mutates

Grouping is most useful in combination with summarize(), but grouped filters and grouped mutates are also possible.

6.2.1 Examples of grouped filters/mutates

# `n()` is a helper function that returns the size of a group.
# Pick out the 72 destinations that were flown to more than 500 times 
flights %>% 
  group_by(dest) %>% 
  filter(n() > 100)
# A tibble: 336,518 x 19
# Groups:   dest [93]
    year month   day dep_time sched_dep_time dep_delay arr_time
   <int> <int> <int>    <int>          <int>     <dbl>    <int>
 1  2013     1     1      517            515         2      830
 2  2013     1     1      533            529         4      850
 3  2013     1     1      542            540         2      923
 4  2013     1     1      544            545        -1     1004
 5  2013     1     1      554            600        -6      812
 6  2013     1     1      554            558        -4      740
 7  2013     1     1      555            600        -5      913
 8  2013     1     1      557            600        -3      709
 9  2013     1     1      557            600        -3      838
10  2013     1     1      558            600        -2      753
# … with 336,508 more rows, and 12 more variables: sched_arr_time <int>,
#   arr_delay <dbl>, carrier <chr>, flight <int>, tailnum <chr>,
#   origin <chr>, dest <chr>, air_time <dbl>, distance <dbl>, hour <dbl>,
#   minute <dbl>, time_hour <dttm>
# Calculate the proportion of a flight's air time to the total airtime of all flights to the same destination
flights %>% 
  select(month, day, origin, dest, dep_time, air_time) %>% 
  group_by(dest) %>% 
  mutate(tot_airtime = sum(air_time, na.rm = TRUE),
         prop_of_tot_airtime = air_time / tot_airtime)
# A tibble: 336,776 x 8
# Groups:   dest [105]
   month   day origin dest  dep_time air_time tot_airtime prop_of_tot_airt…
   <int> <int> <chr>  <chr>    <int>    <dbl>       <dbl>             <dbl>
 1     1     1 EWR    IAH        517      227     1403479         0.000162 
 2     1     1 LGA    IAH        533      227     1403479         0.000162 
 3     1     1 JFK    MIA        542      160     1775327         0.0000901
 4     1     1 JFK    BQN        544      183      173056         0.00106  
 5     1     1 LGA    ATL        554      116     1901410         0.0000610
 6     1     1 EWR    ORD        554      150     1914833         0.0000783
 7     1     1 EWR    FLL        555      158     1811271         0.0000872
 8     1     1 LGA    IAD        557       53      257457         0.000206 
 9     1     1 JFK    MCO        557      140     1882381         0.0000744
10     1     1 LGA    ORD        558      138     1914833         0.0000721
# … with 336,766 more rows