Goal: Learn how to manipulate tibbles, the tidyverse version of data frames, using the dplyr package
Prerequisites: You should be familiar with these aspects of R: console, packages, data frames, and vectors. It is helpful, but not necessary, to have previous programming experience.
The tidyverse is a collection of R packages designed for data science. The tidyverse packages use the same grammar and data structures, making it easy to combine functions from different packages.
The core tidyverse packages:
This quickstart guide focuses on the tibble and dplyr packages, and provides the essential functions to get started on data manipulation. It is based off Chapter 5 ‘Data transformation’ of R for Data Science, by Garrett Grolemund and Hadley Wickham.
There are two types of code blocks in this guide.
# This is an example of R code.
1 + 2
# This is an example of output.
3
The first type of block, with a gray background, is R code. You can copy and run such code blocks in the console of your R editor. You can also download the quickstart-companion R script, which contains all of the R code in this guide.
The second type of block, with a white background, is the output of running whatever R code is above it.
Install and load the tidyverse:
# Install the tidyverse
install.packages("tidyverse")
# Load the tidyverse
library(tidyverse)
Install and load the nycflights13 package, which is a data set that contains information on flights that departed from New York City in 2013. We will use the nycflights13::flights
data frame to explore the tibble and dplyr packages.
install.packages("nycflights13")
library(nycflights13)
The tbl_df
class, or “tibble”, is the most important part of the tibble package. Tibbles are the tidyverse version of the base R data.frame
class, and tidyverse functions are designed to work with tibbles.
A tibble is stricter than a data.frame
. For example, subsetting a tibble always returns another tibble. In comparison, subsetting a data.frame
sometimes returns a data.frame
and sometimes returns a vector
.
This guide uses the terms “tibble” and “data frame” interchangeably.
Learn more about tibbles by running
vignette("tibble")
in your console, or by reading the CRAN documentation on tibbles.
Print the nycflights13::flights
, or simply flights
, data frame:
flights
# A tibble: 336,776 x 19
year month day dep_time sched_dep_time dep_delay arr_time
<int> <int> <int> <int> <int> <dbl> <int>
1 2013 1 1 517 515 2 830
2 2013 1 1 533 529 4 850
3 2013 1 1 542 540 2 923
4 2013 1 1 544 545 -1 1004
5 2013 1 1 554 600 -6 812
6 2013 1 1 554 558 -4 740
7 2013 1 1 555 600 -5 913
8 2013 1 1 557 600 -3 709
9 2013 1 1 557 600 -3 838
10 2013 1 1 558 600 -2 753
# … with 336,766 more rows, and 12 more variables: sched_arr_time <int>,
# arr_delay <dbl>, carrier <chr>, flight <int>, tailnum <chr>,
# origin <chr>, dest <chr>, air_time <dbl>, distance <dbl>, hour <dbl>,
# minute <dbl>, time_hour <dttm>
When a tibble is printed, it displays the first 10 rows and as many columns as will fit on the console. At the top, it shows the total number of rows and columns, and provides the type of each column (also called a “variable”). Tibbles use different font styles and colors to style columns and make certain values more visually distinct. This formatting is not reproduced in the output code block above.
If you print flights
in an R editor like R Studio, the tibble may look like Figure 1 below, where negative values (as in the dep_delay
column) are printed in red, and meta data is printed in gray.
flights
printed in R Studio
All examples in this guide use the
flights
data frame. Run?flights
in your console to read its documentation.
dplyr is a collection of functions for manipulating data frames.
Important dplyr functions:
select()
: subsets a data frame by picking columnsfilter()
: subsets a data frame by picking rowsmutate()
: creates new columns or modifies existing columnsarrange()
: reorder a data frame by row valuessummarize()
: summarize a data framegroup_by()
: groups a data frame by row vauesThese functions have the same structure:
Since dplyr functions return a new data frame, and do not overwrite the original data frame, use the assignment operator <-
to save the output of a dplyr function:
flights
flights_modified <- some_dplyr_function(flights, argument2, argument3)
select()
select()
subsets a data frame by columns.
tbl.df
)Columns that are explicitly named are kept, and all other columns are dropped.
select()
# Select columns by name
select(flights, year, month, day, dep_delay, arr_delay)
# A tibble: 336,776 x 5
year month day dep_delay arr_delay
<int> <int> <int> <dbl> <dbl>
1 2013 1 1 2 11
2 2013 1 1 4 20
3 2013 1 1 2 33
4 2013 1 1 -1 -18
5 2013 1 1 -6 -25
6 2013 1 1 -4 12
7 2013 1 1 -5 19
8 2013 1 1 -3 -14
9 2013 1 1 -3 -8
10 2013 1 1 -2 8
# … with 336,766 more rows
# Select columns from `year` to `arr_time` (inclusive)
select(flights, year:arr_time)
# A tibble: 336,776 x 7
year month day dep_time sched_dep_time dep_delay arr_time
<int> <int> <int> <int> <int> <dbl> <int>
1 2013 1 1 517 515 2 830
2 2013 1 1 533 529 4 850
3 2013 1 1 542 540 2 923
4 2013 1 1 544 545 -1 1004
5 2013 1 1 554 600 -6 812
6 2013 1 1 554 558 -4 740
7 2013 1 1 555 600 -5 913
8 2013 1 1 557 600 -3 709
9 2013 1 1 557 600 -3 838
10 2013 1 1 558 600 -2 753
# … with 336,766 more rows
# Select all columns except the columns from `year` to `arr_time` (inclusive)
select(flights, -(year:arr_time))
# A tibble: 336,776 x 12
sched_arr_time arr_delay carrier flight tailnum origin dest air_time
<int> <dbl> <chr> <int> <chr> <chr> <chr> <dbl>
1 819 11 UA 1545 N14228 EWR IAH 227
2 830 20 UA 1714 N24211 LGA IAH 227
3 850 33 AA 1141 N619AA JFK MIA 160
4 1022 -18 B6 725 N804JB JFK BQN 183
5 837 -25 DL 461 N668DN LGA ATL 116
6 728 12 UA 1696 N39463 EWR ORD 150
7 854 19 B6 507 N516JB EWR FLL 158
8 723 -14 EV 5708 N829AS LGA IAD 53
9 846 -8 B6 79 N593JB JFK MCO 140
10 745 8 AA 301 N3ALAA LGA ORD 138
# … with 336,766 more rows, and 4 more variables: distance <dbl>,
# hour <dbl>, minute <dbl>, time_hour <dttm>
select()
helper functionsselect()
can take helper functions as arguments.
Examples of helper functions:
everything()
: selects all columns that have not already been selectedcontains("abc")
: selects columns with a name that contains “abc”select()
helper functions# Rearrange data frame by moving certain columns to the front and moving all other columns to the back
select(flights, dep_delay, arr_delay, everything())
# A tibble: 336,776 x 19
dep_delay arr_delay year month day dep_time sched_dep_time arr_time
<dbl> <dbl> <int> <int> <int> <int> <int> <int>
1 2 11 2013 1 1 517 515 830
2 4 20 2013 1 1 533 529 850
3 2 33 2013 1 1 542 540 923
4 -1 -18 2013 1 1 544 545 1004
5 -6 -25 2013 1 1 554 600 812
6 -4 12 2013 1 1 554 558 740
7 -5 19 2013 1 1 555 600 913
8 -3 -14 2013 1 1 557 600 709
9 -3 -8 2013 1 1 557 600 838
10 -2 8 2013 1 1 558 600 753
# … with 336,766 more rows, and 11 more variables: sched_arr_time <int>,
# carrier <chr>, flight <int>, tailnum <chr>, origin <chr>, dest <chr>,
# air_time <dbl>, distance <dbl>, hour <dbl>, minute <dbl>,
# time_hour <dttm>
# Select all columns with a name that contains the string "time"
select(flights, contains("time"))
# A tibble: 336,776 x 6
dep_time sched_dep_time arr_time sched_arr_time air_time
<int> <int> <int> <int> <dbl>
1 517 515 830 819 227
2 533 529 850 830 227
3 542 540 923 850 160
4 544 545 1004 1022 183
5 554 600 812 837 116
6 554 558 740 728 150
7 555 600 913 854 158
8 557 600 709 723 53
9 557 600 838 846 140
10 558 600 753 745 138
# … with 336,766 more rows, and 1 more variable: time_hour <dttm>
select()
on flights
to move the year
column to the back.
select(flights, -year, year)
# A tibble: 336,776 x 19
month day dep_time sched_dep_time dep_delay arr_time sched_arr_time
<int> <int> <int> <int> <dbl> <int> <int>
1 1 1 517 515 2 830 819
2 1 1 533 529 4 850 830
3 1 1 542 540 2 923 850
4 1 1 544 545 -1 1004 1022
5 1 1 554 600 -6 812 837
6 1 1 554 558 -4 740 728
7 1 1 555 600 -5 913 854
8 1 1 557 600 -3 709 723
9 1 1 557 600 -3 838 846
10 1 1 558 600 -2 753 745
# … with 336,766 more rows, and 12 more variables: arr_delay <dbl>,
# carrier <chr>, flight <int>, tailnum <chr>, origin <chr>, dest <chr>,
# air_time <dbl>, distance <dbl>, hour <dbl>, minute <dbl>,
# time_hour <dttm>, year <int>
select()
on flights
to select the first three columns, and any column that contains the string "arr"
.
select(flights, year:day, contains("arr"))
# A tibble: 336,776 x 7
year month day arr_time sched_arr_time arr_delay carrier
<int> <int> <int> <int> <int> <dbl> <chr>
1 2013 1 1 830 819 11 UA
2 2013 1 1 850 830 20 UA
3 2013 1 1 923 850 33 AA
4 2013 1 1 1004 1022 -18 B6
5 2013 1 1 812 837 -25 DL
6 2013 1 1 740 728 12 UA
7 2013 1 1 913 854 19 B6
8 2013 1 1 709 723 -14 EV
9 2013 1 1 838 846 -8 B6
10 2013 1 1 753 745 8 AA
# … with 336,766 more rows
filter()
filter()
subsets a data frame by rows.
tbl.df
)The comparison operators >
, >=
, <
, <=
, !=
, and ==
create predicates, and the logical operators &
(AND), |
(OR), and !
(NOT) join predicates.
Rows where the predicate evaluates to TRUE
are kept, and all other rows are dropped.
filter()
# Pick out flights that departed on January 1st
filter(flights, month == 1 & day == 1)
# A tibble: 842 x 19
year month day dep_time sched_dep_time dep_delay arr_time
<int> <int> <int> <int> <int> <dbl> <int>
1 2013 1 1 517 515 2 830
2 2013 1 1 533 529 4 850
3 2013 1 1 542 540 2 923
4 2013 1 1 544 545 -1 1004
5 2013 1 1 554 600 -6 812
6 2013 1 1 554 558 -4 740
7 2013 1 1 555 600 -5 913
8 2013 1 1 557 600 -3 709
9 2013 1 1 557 600 -3 838
10 2013 1 1 558 600 -2 753
# … with 832 more rows, and 12 more variables: sched_arr_time <int>,
# arr_delay <dbl>, carrier <chr>, flight <int>, tailnum <chr>,
# origin <chr>, dest <chr>, air_time <dbl>, distance <dbl>, hour <dbl>,
# minute <dbl>, time_hour <dttm>
# Pick out flights that departed in January, but not flights that departed on January 1st
filter(flights, month == 1 & day != 1)
# A tibble: 26,162 x 19
year month day dep_time sched_dep_time dep_delay arr_time
<int> <int> <int> <int> <int> <dbl> <int>
1 2013 1 2 42 2359 43 518
2 2013 1 2 126 2250 156 233
3 2013 1 2 458 500 -2 703
4 2013 1 2 512 515 -3 809
5 2013 1 2 535 540 -5 831
6 2013 1 2 536 529 7 840
7 2013 1 2 539 545 -6 959
8 2013 1 2 554 600 -6 845
9 2013 1 2 554 600 -6 841
10 2013 1 2 554 600 -6 909
# … with 26,152 more rows, and 12 more variables: sched_arr_time <int>,
# arr_delay <dbl>, carrier <chr>, flight <int>, tailnum <chr>,
# origin <chr>, dest <chr>, air_time <dbl>, distance <dbl>, hour <dbl>,
# minute <dbl>, time_hour <dttm>
# Pick out flights that departed in January or in December
filter(flights, month == 1 | month == 12)
# A tibble: 55,139 x 19
year month day dep_time sched_dep_time dep_delay arr_time
<int> <int> <int> <int> <int> <dbl> <int>
1 2013 1 1 517 515 2 830
2 2013 1 1 533 529 4 850
3 2013 1 1 542 540 2 923
4 2013 1 1 544 545 -1 1004
5 2013 1 1 554 600 -6 812
6 2013 1 1 554 558 -4 740
7 2013 1 1 555 600 -5 913
8 2013 1 1 557 600 -3 709
9 2013 1 1 557 600 -3 838
10 2013 1 1 558 600 -2 753
# … with 55,129 more rows, and 12 more variables: sched_arr_time <int>,
# arr_delay <dbl>, carrier <chr>, flight <int>, tailnum <chr>,
# origin <chr>, dest <chr>, air_time <dbl>, distance <dbl>, hour <dbl>,
# minute <dbl>, time_hour <dttm>
When writing a predicate, pay attention to operator precedence. In particular, the comparison operators have higher precedence than the logical operators.
For the following exercises, create a new data frame called flights2
.
flights2 <- select(flights, tailnum, starts_with("dep_"), starts_with("arr_"), distance)
filter()
on flights2
to find all flights that traveled 2000 or more miles.
filter(flights2, distance >= 2000)
# A tibble: 51,695 x 6
tailnum dep_time dep_delay arr_time arr_delay distance
<chr> <int> <dbl> <int> <dbl> <dbl>
1 N29129 558 -2 924 7 2475
2 N53441 558 -2 923 -14 2565
3 N76515 559 -1 854 -8 2227
4 N532UA 611 11 945 14 2586
5 N807AW 622 -8 1017 3 2133
6 N535UW 627 -3 1018 0 2153
7 N33289 628 -2 1016 29 2454
8 N38727 646 1 1023 -7 2434
9 N558JB 651 -4 936 -6 2248
10 N705TW 655 -5 1037 -8 2586
# … with 51,685 more rows
filter()
on flights2
to find all flights that departed early or more than 1 hour late.
filter(flights2, dep_delay < 0 | dep_delay > 60)
# A tibble: 210,156 x 6
tailnum dep_time dep_delay arr_time arr_delay distance
<chr> <int> <dbl> <int> <dbl> <dbl>
1 N804JB 544 -1 1004 -18 1576
2 N668DN 554 -6 812 -25 762
3 N39463 554 -4 740 12 719
4 N516JB 555 -5 913 19 1065
5 N829AS 557 -3 709 -14 229
6 N593JB 557 -3 838 -8 944
7 N3ALAA 558 -2 753 8 733
8 N793JB 558 -2 849 -2 1028
9 N657JB 558 -2 853 -3 1005
10 N29129 558 -2 924 7 2475
# … with 210,146 more rows
mutate()
mutate()
creates new columns or modifies existing columns.
tbl.df
)col_name = col_value
.If a column with the name col_name
does not exist, then a new column is added to the end of the data frame. Its name is col_name
, and its values are determined by col_value
.
If a column with the name col_name
already exists, then no new column is added and the col_name
column keeps its original postion. All of the values in col_name
are modified according to col_value
.
mutate()
mutate()
adds new columns to the end of a data frame, so newly created columns may not be displayed in the console. Create a narrower data frame called flights_narrow
by using select()
:
flights_narrow <- select(flights, year:day, distance, air_time, contains("delay"))
The following examples use the flights_narrow
data frame.
# Create a column called `tot_delay`. Each value is a sum of the`dep_delay` and `arr_delay` values in the same row.
mutate(flights_narrow, tot_delay = dep_delay + arr_delay)
# A tibble: 336,776 x 8
year month day distance air_time dep_delay arr_delay tot_delay
<int> <int> <int> <dbl> <dbl> <dbl> <dbl> <dbl>
1 2013 1 1 1400 227 2 11 13
2 2013 1 1 1416 227 4 20 24
3 2013 1 1 1089 160 2 33 35
4 2013 1 1 1576 183 -1 -18 -19
5 2013 1 1 762 116 -6 -25 -31
6 2013 1 1 719 150 -4 12 8
7 2013 1 1 1065 158 -5 19 14
8 2013 1 1 229 53 -3 -14 -17
9 2013 1 1 944 140 -3 -8 -11
10 2013 1 1 733 138 -2 8 6
# … with 336,766 more rows
# Create multiple columns at once
mutate(flights_narrow, hours = air_time / 60, speed = distance / hours)
# A tibble: 336,776 x 9
year month day distance air_time dep_delay arr_delay hours speed
<int> <int> <int> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
1 2013 1 1 1400 227 2 11 3.78 370.
2 2013 1 1 1416 227 4 20 3.78 374.
3 2013 1 1 1089 160 2 33 2.67 408.
4 2013 1 1 1576 183 -1 -18 3.05 517.
5 2013 1 1 762 116 -6 -25 1.93 394.
6 2013 1 1 719 150 -4 12 2.5 288.
7 2013 1 1 1065 158 -5 19 2.63 404.
8 2013 1 1 229 53 -3 -14 0.883 259.
9 2013 1 1 944 140 -3 -8 2.33 405.
10 2013 1 1 733 138 -2 8 2.3 319.
# … with 336,766 more rows
# Modify an existing column
mutate(flights_narrow, distance = 10 - 2)
# A tibble: 336,776 x 7
year month day distance air_time dep_delay arr_delay
<int> <int> <int> <dbl> <dbl> <dbl> <dbl>
1 2013 1 1 8 227 2 11
2 2013 1 1 8 227 4 20
3 2013 1 1 8 160 2 33
4 2013 1 1 8 183 -1 -18
5 2013 1 1 8 116 -6 -25
6 2013 1 1 8 150 -4 12
7 2013 1 1 8 158 -5 19
8 2013 1 1 8 53 -3 -14
9 2013 1 1 8 140 -3 -8
10 2013 1 1 8 138 -2 8
# … with 336,766 more rows
In the examples above, mutate()
used the arithmetic operators (+
, -
, *
, /
) to create new columns. mutate()
can use any vectorized operator or function to create a new column. A vectorized operator/function takes a vector of values as input and returns as ouput a vector of the same size.
Examples of vectorized operators/functions :
%/%
: integer division operator%/%
: modulo operatorcumsum(x)
: calculates the cumulative sum of the elements in x
# Create columns that separate `air_time` into hours and minutes
mutate(flights_narrow, at_hr = air_time %/% 60, at_min = air_time %% 60)
# A tibble: 336,776 x 9
year month day distance air_time dep_delay arr_delay at_hr at_min
<int> <int> <int> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
1 2013 1 1 1400 227 2 11 3 47
2 2013 1 1 1416 227 4 20 3 47
3 2013 1 1 1089 160 2 33 2 40
4 2013 1 1 1576 183 -1 -18 3 3
5 2013 1 1 762 116 -6 -25 1 56
6 2013 1 1 719 150 -4 12 2 30
7 2013 1 1 1065 158 -5 19 2 38
8 2013 1 1 229 53 -3 -14 0 53
9 2013 1 1 944 140 -3 -8 2 20
10 2013 1 1 733 138 -2 8 2 18
# … with 336,766 more rows
# Create a column that cumulatively sums the arrival delays
mutate(flights_narrow, cum_arr_delay = cumsum(arr_delay))
# A tibble: 336,776 x 8
year month day distance air_time dep_delay arr_delay cum_arr_delay
<int> <int> <int> <dbl> <dbl> <dbl> <dbl> <dbl>
1 2013 1 1 1400 227 2 11 11
2 2013 1 1 1416 227 4 20 31
3 2013 1 1 1089 160 2 33 64
4 2013 1 1 1576 183 -1 -18 46
5 2013 1 1 762 116 -6 -25 21
6 2013 1 1 719 150 -4 12 33
7 2013 1 1 1065 158 -5 19 52
8 2013 1 1 229 53 -3 -14 38
9 2013 1 1 944 140 -3 -8 30
10 2013 1 1 733 138 -2 8 38
# … with 336,766 more rows
select()
on flights
to create a new data frame called flights_c
that contains only character type columns (Hint: use glimpse(flights)
). Then, use mutate()
on flights_c
to create a column called trip
. If a plane flies from EWR to IAH, the value of trip
should be "EWR to IAH"
.
flights_c <- select(flights, carrier, tailnum, origin, dest)
mutate(flights_c, trip = str_c(origin, " to ", dest))
# A tibble: 336,776 x 5
carrier tailnum origin dest trip
<chr> <chr> <chr> <chr> <chr>
1 UA N14228 EWR IAH EWR to IAH
2 UA N24211 LGA IAH LGA to IAH
3 AA N619AA JFK MIA JFK to MIA
4 B6 N804JB JFK BQN JFK to BQN
5 DL N668DN LGA ATL LGA to ATL
6 UA N39463 EWR ORD EWR to ORD
7 B6 N516JB EWR FLL EWR to FLL
8 EV N829AS LGA IAD LGA to IAD
9 B6 N593JB JFK MCO JFK to MCO
10 AA N3ALAA LGA ORD LGA to ORD
# … with 336,766 more rows
# Alternatively:
# mutate(flights_c, trip = str_c(origin, dest, sep = " to "))
mutate()
on flights_narrow
to create a column called tot_delay
, which calculates the total delay of each flight. Then, create a column called prop_dep_delay
, which calculates the proportion of each flight’s departure delay to its total delay.
mutate(flights_narrow, tot_delay = dep_delay + arr_delay,
prop_dep_delay = dep_delay / tot_delay)
# A tibble: 336,776 x 9
year month day distance air_time dep_delay arr_delay tot_delay
<int> <int> <int> <dbl> <dbl> <dbl> <dbl> <dbl>
1 2013 1 1 1400 227 2 11 13
2 2013 1 1 1416 227 4 20 24
3 2013 1 1 1089 160 2 33 35
4 2013 1 1 1576 183 -1 -18 -19
5 2013 1 1 762 116 -6 -25 -31
6 2013 1 1 719 150 -4 12 8
7 2013 1 1 1065 158 -5 19 14
8 2013 1 1 229 53 -3 -14 -17
9 2013 1 1 944 140 -3 -8 -11
10 2013 1 1 733 138 -2 8 6
# … with 336,766 more rows, and 1 more variable: prop_dep_delay <dbl>
arrange()
arrange()
reorders the rows of a data frame.
tbl.df
)Values in an argument column are sorted in ascending order by default. Wrap a column name in desc()
to sort in descending order.
arrange()
# Order flights from least departure delay to most
arrange(flights, dep_delay)
# A tibble: 336,776 x 19
year month day dep_time sched_dep_time dep_delay arr_time
<int> <int> <int> <int> <int> <dbl> <int>
1 2013 12 7 2040 2123 -43 40
2 2013 2 3 2022 2055 -33 2240
3 2013 11 10 1408 1440 -32 1549
4 2013 1 11 1900 1930 -30 2233
5 2013 1 29 1703 1730 -27 1947
6 2013 8 9 729 755 -26 1002
7 2013 10 23 1907 1932 -25 2143
8 2013 3 30 2030 2055 -25 2213
9 2013 3 2 1431 1455 -24 1601
10 2013 5 5 934 958 -24 1225
# … with 336,766 more rows, and 12 more variables: sched_arr_time <int>,
# arr_delay <dbl>, carrier <chr>, flight <int>, tailnum <chr>,
# origin <chr>, dest <chr>, air_time <dbl>, distance <dbl>, hour <dbl>,
# minute <dbl>, time_hour <dttm>
# Order flights from most departure delay to least
arrange(flights, desc(dep_delay))
# A tibble: 336,776 x 19
year month day dep_time sched_dep_time dep_delay arr_time
<int> <int> <int> <int> <int> <dbl> <int>
1 2013 1 9 641 900 1301 1242
2 2013 6 15 1432 1935 1137 1607
3 2013 1 10 1121 1635 1126 1239
4 2013 9 20 1139 1845 1014 1457
5 2013 7 22 845 1600 1005 1044
6 2013 4 10 1100 1900 960 1342
7 2013 3 17 2321 810 911 135
8 2013 6 27 959 1900 899 1236
9 2013 7 22 2257 759 898 121
10 2013 12 5 756 1700 896 1058
# … with 336,766 more rows, and 12 more variables: sched_arr_time <int>,
# arr_delay <dbl>, carrier <chr>, flight <int>, tailnum <chr>,
# origin <chr>, dest <chr>, air_time <dbl>, distance <dbl>, hour <dbl>,
# minute <dbl>, time_hour <dttm>
summarize()
summarize()
summarizes a data frame into a single row.
tbl.df
)col_name = summary_function()
Examples of summary functions:
mean(x)
: calculates the arithmetic mean of the values in x
n_distinct(x)
: counts the number of unique values in x
Most summary functions take an optional na.rm
argument, which stands for “remove missing (NA
) values”. The flights
data frame contains many missing values, so all of the examples of summary functions in this guide will include the na.rm = TRUE
argument.
summarize()
# Calculate the average departure delay
summarize(flights, avg_delay = mean(dep_delay, na.rm = TRUE))
# A tibble: 1 x 1
avg_delay
<dbl>
1 12.6
# Count the number unique departure times
summarize(flights, unique_dep_times = n_distinct(dep_time, na.rm = TRUE))
# A tibble: 1 x 1
unique_dep_times
<int>
1 1318
summarize()
is more useful when it is combined withgroup_by()
. To learn more, read Grouped summaries.
group_by()
group_by()
groups a data frame by row values.
tbl.df
)group_by()
does not change the row order of a data frame, so it may seem that calling group_by()
has no effect. However, a grouped data frame’s meta data header displays the groups and their count.
The function ungroup(df)
removes the groups of a grouped data frame df
.
group_by()
and ungroup()
# Group the data frame by month (12 groups, one for each month)
grouped_df <- group_by(flights, month)
grouped_df
# A tibble: 336,776 x 19
# Groups: month [12]
year month day dep_time sched_dep_time dep_delay arr_time
<int> <int> <int> <int> <int> <dbl> <int>
1 2013 1 1 517 515 2 830
2 2013 1 1 533 529 4 850
3 2013 1 1 542 540 2 923
4 2013 1 1 544 545 -1 1004
5 2013 1 1 554 600 -6 812
6 2013 1 1 554 558 -4 740
7 2013 1 1 555 600 -5 913
8 2013 1 1 557 600 -3 709
9 2013 1 1 557 600 -3 838
10 2013 1 1 558 600 -2 753
# … with 336,766 more rows, and 12 more variables: sched_arr_time <int>,
# arr_delay <dbl>, carrier <chr>, flight <int>, tailnum <chr>,
# origin <chr>, dest <chr>, air_time <dbl>, distance <dbl>, hour <dbl>,
# minute <dbl>, time_hour <dttm>
# Ungroup a grouped data frame
ungroup(grouped_df)
# A tibble: 336,776 x 19
year month day dep_time sched_dep_time dep_delay arr_time
<int> <int> <int> <int> <int> <dbl> <int>
1 2013 1 1 517 515 2 830
2 2013 1 1 533 529 4 850
3 2013 1 1 542 540 2 923
4 2013 1 1 544 545 -1 1004
5 2013 1 1 554 600 -6 812
6 2013 1 1 554 558 -4 740
7 2013 1 1 555 600 -5 913
8 2013 1 1 557 600 -3 709
9 2013 1 1 557 600 -3 838
10 2013 1 1 558 600 -2 753
# … with 336,766 more rows, and 12 more variables: sched_arr_time <int>,
# arr_delay <dbl>, carrier <chr>, flight <int>, tailnum <chr>,
# origin <chr>, dest <chr>, air_time <dbl>, distance <dbl>, hour <dbl>,
# minute <dbl>, time_hour <dttm>
%>%
The pipe operator %>%
, or “the pipe”, is a part of the magrittr package, a sub-package within dplyr. It takes whatever is on its left and uses it as the first argument of the function on its right.
Keyboard shortcut:
CMD + SHIFT + M
(Mac) orCTL + SHIFT + M
(Windows)
dplyr functions do not modify the data frame they take as input, and return a new data frame instead. In traditional R, this means that using multiple dplyr functions to transform a single data frame involves intermediate data frames.
The output of each dplyr function must be saved as a new, intermediate, data frame that the next function takes as input.
Rank total flight delay from greatest to least, but only for flights that departed from JFK. Keep only the carrier
, dest
, distance
, air_time
, dep_delay
, arr_delay
and tot_delay
columns.
# 1. Pick out flights that departed from JFK
step1 <- filter(flights, origin == "JFK")
# 2. Create a column for total delay times
step2 <- mutate(step1, tot_delay = dep_delay + arr_delay)
# 3. Order flights from most total delay to least
step3 <- arrange(step2, desc(tot_delay))
# 4. Keep only the desired columns
step4 <- select(step3, carrier, dest, distance, air_time, dep_delay, arr_delay, tot_delay)
# View transformed data frame
step4
# A tibble: 111,279 x 7
carrier dest distance air_time dep_delay arr_delay tot_delay
<chr> <chr> <dbl> <dbl> <dbl> <dbl> <dbl>
1 HA HNL 4983 640 1301 1272 2573
2 MQ CMH 483 74 1137 1127 2264
3 AA SFO 2586 354 1014 1007 2021
4 MQ CVG 589 96 1005 989 1994
5 DL TPA 1005 139 960 931 1891
6 DL PDX 2454 313 899 850 1749
7 AA LAS 2248 323 853 852 1705
8 MQ BWI 184 41 853 851 1704
9 DL TPA 1005 173 825 856 1681
10 DL LAX 2475 335 800 784 1584
# … with 111,269 more rows
The number of intermediate data frames quickly increases as the desired manipulation becomes more complex, or as more data frames are involved. Naming these intermediate data frames and keeping them organized can slow down data manipulation, and greatly increases the length of your code.
The pipe takes whatever is on its left and uses it as the first argument of the function on its right. Since dplyr functions always return a data frame and take a data frame as its first argument, the pipe can pass a data frame from function to function without using intermediate data frames.
%>%
# This code is equivalent to the code in section 5.1.1
just_1_step <- flights %>%
filter(origin == "JFK") %>%
mutate(tot_delay = dep_delay + arr_delay) %>%
arrange(desc(tot_delay)) %>%
select(carrier, dest, distance, air_time, dep_delay, arr_delay, tot_delay)
just_1_step
# A tibble: 111,279 x 7
carrier dest distance air_time dep_delay arr_delay tot_delay
<chr> <chr> <dbl> <dbl> <dbl> <dbl> <dbl>
1 HA HNL 4983 640 1301 1272 2573
2 MQ CMH 483 74 1137 1127 2264
3 AA SFO 2586 354 1014 1007 2021
4 MQ CVG 589 96 1005 989 1994
5 DL TPA 1005 139 960 931 1891
6 DL PDX 2454 313 899 850 1749
7 AA LAS 2248 323 853 852 1705
8 MQ BWI 184 41 853 851 1704
9 DL TPA 1005 173 825 856 1681
10 DL LAX 2475 335 800 784 1584
# … with 111,269 more rows
%>%
and dplyr functions on flights
to do the following:origin
, dest
, and dep_delay
dep_delay2
whose value is the string "x hrs and y min"
, where x is the number of hours and y is the number of minutes of departure delay.
dep_delay
is 134
, the value of dep_delay2
should be "2 hrs and 14 min"
.as.character()
and str_c()
flights %>%
filter(dep_delay > 120) %>%
select(origin, dest, dep_delay, dep_delay) %>%
mutate(dep_delay2 = str_c(as.character(dep_delay %/% 60), " hrs and ", as.character(dep_delay%%60), " min"))
# A tibble: 9,723 x 4
origin dest dep_delay dep_delay2
<chr> <chr> <dbl> <chr>
1 JFK BWI 853 14 hrs and 13 min
2 EWR BOS 144 2 hrs and 24 min
3 LGA IAH 134 2 hrs and 14 min
4 JFK SJU 122 2 hrs and 2 min
5 EWR OMA 290 4 hrs and 50 min
6 EWR BTV 260 4 hrs and 20 min
7 JFK LAX 131 2 hrs and 11 min
8 JFK BNA 129 2 hrs and 9 min
9 EWR RIC 155 2 hrs and 35 min
10 JFK DCA 157 2 hrs and 37 min
# … with 9,713 more rows
The pipe makes it easy to apply multiple functions to a single data frame. It is particularly useful to create a grouped data frame wih group_by()
, and then apply some other dplyr function to every group.
When summarize()
takes a grouped data frame as its first argument, the summary function arguments are applied to each group separately. The output data frame has as many rows as there are groups in the input data frame.
# Calculate the average departure delay, by month
flights %>%
group_by(month) %>%
summarize(avg_delay = mean(dep_delay, na.rm = TRUE))
# A tibble: 12 x 2
month avg_delay
<int> <dbl>
1 1 10.0
2 2 10.8
3 3 13.2
4 4 13.9
5 5 13.0
6 6 20.8
7 7 21.7
8 8 12.6
9 9 6.72
10 10 6.24
11 11 5.44
12 12 16.6
# Validate the grouped summary by calculating just the average departure delay for June
flights %>%
filter(month == 6) %>%
summarize(avg_delay = mean(dep_delay, na.rm = TRUE))
# A tibble: 1 x 1
avg_delay
<dbl>
1 20.8
# Calculate the total departure delay by origin and destination, then order rows from most total delay to least
flights %>%
group_by(origin, dest) %>%
summarize(dep_delay = sum(dep_delay, na.rm = TRUE)) %>%
arrange(desc(dep_delay))
# A tibble: 224 x 3
# Groups: origin [3]
origin dest dep_delay
<chr> <chr> <dbl>
1 LGA ATL 115425
2 JFK SFO 97271
3 JFK LAX 95418
4 LGA ORD 91812
5 EWR ORD 85683
6 EWR ATL 75819
7 EWR SFO 72950
8 JFK BOS 67667
9 EWR BOS 65877
10 JFK MCO 57609
# … with 214 more rows
group_by()
and summarize()
to find the carrier with the best on-time average (lowest average arrival delay).
flights %>%
group_by(carrier) %>%
summarize(avg_delay = mean(arr_delay, na.rm = TRUE)) %>%
arrange(avg_delay)
# A tibble: 16 x 2
carrier avg_delay
<chr> <dbl>
1 AS -9.93
2 HA -6.92
3 AA 0.364
4 DL 1.64
5 VX 1.76
6 US 2.13
7 UA 3.56
8 9E 7.38
9 B6 9.46
10 WN 9.65
11 MQ 10.8
12 OO 11.9
13 YV 15.6
14 EV 15.8
15 FL 20.1
16 F9 21.9
Grouping is most useful in combination with summarize()
, but grouped filters and grouped mutates are also possible.
# `n()` is a helper function that returns the size of a group.
# Pick out the 72 destinations that were flown to more than 500 times
flights %>%
group_by(dest) %>%
filter(n() > 100)
# A tibble: 336,518 x 19
# Groups: dest [93]
year month day dep_time sched_dep_time dep_delay arr_time
<int> <int> <int> <int> <int> <dbl> <int>
1 2013 1 1 517 515 2 830
2 2013 1 1 533 529 4 850
3 2013 1 1 542 540 2 923
4 2013 1 1 544 545 -1 1004
5 2013 1 1 554 600 -6 812
6 2013 1 1 554 558 -4 740
7 2013 1 1 555 600 -5 913
8 2013 1 1 557 600 -3 709
9 2013 1 1 557 600 -3 838
10 2013 1 1 558 600 -2 753
# … with 336,508 more rows, and 12 more variables: sched_arr_time <int>,
# arr_delay <dbl>, carrier <chr>, flight <int>, tailnum <chr>,
# origin <chr>, dest <chr>, air_time <dbl>, distance <dbl>, hour <dbl>,
# minute <dbl>, time_hour <dttm>
# Calculate the proportion of a flight's air time to the total airtime of all flights to the same destination
flights %>%
select(month, day, origin, dest, dep_time, air_time) %>%
group_by(dest) %>%
mutate(tot_airtime = sum(air_time, na.rm = TRUE),
prop_of_tot_airtime = air_time / tot_airtime)
# A tibble: 336,776 x 8
# Groups: dest [105]
month day origin dest dep_time air_time tot_airtime prop_of_tot_airt…
<int> <int> <chr> <chr> <int> <dbl> <dbl> <dbl>
1 1 1 EWR IAH 517 227 1403479 0.000162
2 1 1 LGA IAH 533 227 1403479 0.000162
3 1 1 JFK MIA 542 160 1775327 0.0000901
4 1 1 JFK BQN 544 183 173056 0.00106
5 1 1 LGA ATL 554 116 1901410 0.0000610
6 1 1 EWR ORD 554 150 1914833 0.0000783
7 1 1 EWR FLL 555 158 1811271 0.0000872
8 1 1 LGA IAD 557 53 257457 0.000206
9 1 1 JFK MCO 557 140 1882381 0.0000744
10 1 1 LGA ORD 558 138 1914833 0.0000721
# … with 336,766 more rows