Goal: Learn how to manipulate tibbles, the tidyverse version of data frames, using the dplyr package
The tidyverse is a collection of R packages designed for data science. The tidyverse packages use the same grammar and data structures, making it easy to combine functions from different packages.
The core tidyverse packages:
This quickstart guide focuses on the tibble and dplyr packages, and provides the essential functions to get started on data manipulation. It is based off Chapter 5 ‘Data transformation’ of R for Data Science, by Garrett Grolemund and Hadley Wickham.
There are two types of code blocks in this guide.
# This is an example of R code. 1 + 2
# This is an example of output. 3
The first type of block, with a gray background, is R code. You can copy and run such code blocks in the console of your R editor. You can also download the quickstart-companion R script, which contains all of the R code in this guide.
The second type of block, with a white background, is the output of running whatever R code is above it.
Install and load the tidyverse:
# Install the tidyverse install.packages("tidyverse") # Load the tidyverse library(tidyverse)
Install and load the nycflights13 package, which is a data set that contains information on flights that departed from New York City in 2013. We will use the
nycflights13::flights data frame to explore the tibble and dplyr packages.
tbl_df class, or “tibble”, is the most important part of the tibble package. Tibbles are the tidyverse version of the base R
data.frame class, and tidyverse functions are designed to work with tibbles.
A tibble is stricter than a
data.frame. For example, subsetting a tibble always returns another tibble. In comparison, subsetting a
data.frame sometimes returns a
data.frame and sometimes returns a
This guide uses the terms “tibble” and “data frame” interchangeably.
Learn more about tibbles by running
vignette("tibble")in your console, or by reading the CRAN documentation on tibbles.
nycflights13::flights, or simply
flights, data frame:
# A tibble: 336,776 x 19 year month day dep_time sched_dep_time dep_delay arr_time <int> <int> <int> <int> <int> <dbl> <int> 1 2013 1 1 517 515 2 830 2 2013 1 1 533 529 4 850 3 2013 1 1 542 540 2 923 4 2013 1 1 544 545 -1 1004 5 2013 1 1 554 600 -6 812 6 2013 1 1 554 558 -4 740 7 2013 1 1 555 600 -5 913 8 2013 1 1 557 600 -3 709 9 2013 1 1 557 600 -3 838 10 2013 1 1 558 600 -2 753 # … with 336,766 more rows, and 12 more variables: sched_arr_time <int>, # arr_delay <dbl>, carrier <chr>, flight <int>, tailnum <chr>, # origin <chr>, dest <chr>, air_time <dbl>, distance <dbl>, hour <dbl>, # minute <dbl>, time_hour <dttm>
When a tibble is printed, it displays the first 10 rows and as many columns as will fit on the console. At the top, it shows the total number of rows and columns, and provides the type of each column (also called a “variable”). Tibbles use different font styles and colors to style columns and make certain values more visually distinct. This formatting is not reproduced in the output code block above.
If you print
flights in an R editor like R Studio, the tibble may look like Figure 1 below, where negative values (as in the
dep_delay column) are printed in red, and meta data is printed in gray.