Goal: Learn how to manipulate tibbles, the tidyverse version of data frames, using the dplyr package

Prerequisites: You should be familiar with these aspects of R: console, packages, data frames, and vectors. It is helpful, but not necessary, to have previous programming experience.

1 Introduction

The tidyverse is a collection of R packages designed for data science. The tidyverse packages use the same grammar and data structures, making it easy to combine functions from different packages.

The core tidyverse packages:

  • tibble: for tibbles, the tidyverse version of data frames
  • dplyr: for data manipulation
  • readr: for data import
  • tidyr: for data tidying
  • stringr: for string manipulation
  • purrr: for functional programming
  • forcats: for factors
  • ggplot2: for data visualization

This quickstart guide focuses on the tibble and dplyr packages, and provides the essential functions to get started on data manipulation. It is based off Chapter 5 ‘Data transformation’ of R for Data Science, by Garrett Grolemund and Hadley Wickham.

 

2 Getting started

2.1 Running R code

There are two types of code blocks in this guide.

# This is an example of R code.
1 + 2
# This is an example of output.
3

The first type of block, with a gray background, is R code. You can copy and run such code blocks in the console of your R editor. You can also download the quickstart-companion R script, which contains all of the R code in this guide.

The second type of block, with a white background, is the output of running whatever R code is above it.

2.2 Installation

Install and load the tidyverse:

# Install the tidyverse
install.packages("tidyverse")

# Load the tidyverse
library(tidyverse)

Install and load the nycflights13 package, which is a data set that contains information on flights that departed from New York City in 2013. We will use the nycflights13::flights data frame to explore the tibble and dplyr packages.

install.packages("nycflights13")
library(nycflights13)

 

3 tibble

The tbl_df class, or “tibble”, is the most important part of the tibble package. Tibbles are the tidyverse version of the base R data.frame class, and tidyverse functions are designed to work with tibbles.

A tibble is stricter than a data.frame. For example, subsetting a tibble always returns another tibble. In comparison, subsetting a data.frame sometimes returns a data.frame and sometimes returns a vector.

This guide uses the terms “tibble” and “data frame” interchangeably.

Learn more about tibbles by running vignette("tibble") in your console, or by reading the CRAN documentation on tibbles.

3.1 Printing a tibble

Print the nycflights13::flights, or simply flights, data frame:

flights
# A tibble: 336,776 x 19
    year month   day dep_time sched_dep_time dep_delay arr_time
   <int> <int> <int>    <int>          <int>     <dbl>    <int>
 1  2013     1     1      517            515         2      830
 2  2013     1     1      533            529         4      850
 3  2013     1     1      542            540         2      923
 4  2013     1     1      544            545        -1     1004
 5  2013     1     1      554            600        -6      812
 6  2013     1     1      554            558        -4      740
 7  2013     1     1      555            600        -5      913
 8  2013     1     1      557            600        -3      709
 9  2013     1     1      557            600        -3      838
10  2013     1     1      558            600        -2      753
# … with 336,766 more rows, and 12 more variables: sched_arr_time <int>,
#   arr_delay <dbl>, carrier <chr>, flight <int>, tailnum <chr>,
#   origin <chr>, dest <chr>, air_time <dbl>, distance <dbl>, hour <dbl>,
#   minute <dbl>, time_hour <dttm>

When a tibble is printed, it displays the first 10 rows and as many columns as will fit on the console. At the top, it shows the total number of rows and columns, and provides the type of each column (also called a “variable”). Tibbles use different font styles and colors to style columns and make certain values more visually distinct. This formatting is not reproduced in the output code block above.

If you print flights in an R editor like R Studio, the tibble may look like Figure 1 below, where negative values (as in the dep_delay column) are printed in red, and meta data is printed in gray.