Goal: See practical examples of using tidyverse packages and user-defined functions to analyze and visualize a psycholinguistic experiment.

Prerequisites: You should be familiar with the tidyverse package dplyr and psycholinguistic experiment analysis.

1 Introduction

This document uses tidyverse packages to analyze and visualize data from a real psycholinguistic experiment, the eye tracking while reading experiment RC. This guide also introduces user-defined functions, and provides examples of defining functions to facilitate data analysis and visualization.

2 RC

RC is an eye tracking while reading experiment conducted at the UCLA Language Processing Lab by Angelica Pan and supervised by Jesse Harris. Data was collected by an SR Eyelink 1000 Tower Mount and preprocessed with EyeDry.

2.1 Eye tracking while reading

In an eye tracking while reading experiment, subjects read text that is displayed on a computer screen. As they read, a high speed camera records their eye movements. Eye tracking while reading is a popular method for studying sentence processing because certain eye movement patterns, such as long fixations on difficult sentence structures, have been linked to cognitive effort.

To learn more about eye tracking, read Eye Movements in Reading Words and Sentences (2007) by Charles Clifton Jr., Adrian Staub, and Keith Rayner.

2.2 Experimental Design

RC has a 2x2 latin square design, crossing relative clause matrix position (pos) with relative clause extraction type (ext). This experiment looked at the effects of pos and ext on relative clause processing difficulty.

Factors and levels:

  • pos: matrix subject (SUB) or matrix object (OBJ)
  • ext: subject-extracted (SRC) or object-extracted (ORC)

Conditions:

  • SUB-SRC: subject-extracted relative clause in matrix subject position
  • SUB-ORC: subject-extracted relative clause in matrix object position
  • OBJ-SRC: object-extracted relative clause in matrix subject position
  • OBJ-ORC: object-extracted relative clause in matrix object position

2.3 Experimental items

Experimental items are sentence quartets:

  • SUB-SRC: Unsurprisingly, the diplomat who attacked the senator earlier this afternoon avoided the reporter at the charity event.
  • SUB-ORC: Unsurprisingly, the diplomat who the senator attacked earlier this afternoon avoided the reporter at the charity event.
  • OBJ-SRC: Unsurprisingly, the diplomat avoided the reporter who attacked the senator earlier this afternoon at the charity event.
  • OBJ-ORC: Unsurprisingly, the diplomat avoided the reporter who the senator attacked earlier this afternoon at the charity event.

2.4 Regioning

During the experiment, subjects read experimental items as a single line of unbroken text displayed on a computer monitor.

Items are split into 7 regions for analysis:

  • intro: introductory region
  • msub: matrix subject region
  • RC: relative clause region (also called “region of interest”, or “critical region”")
  • spillover: spillover region
  • mverb: matrix verb region
  • mobj: matrix object region
  • final: final region

The linear order of the regions differs across the SUB and OBJ levels.

Region order for SUB items:

  • intro1 | msub2 | RC3 | spillover4 | mverb5 | mobj6 | final7

Region order for OBJ items:

  • intro1 | msub2 | mverb3 | mobj4 | RC5 | spillover6 | final7

2.5 Preprocessing

EyeDry is a program that preprocesses eye tracking data. It extracts eye movement metrics from raw eye tracking data into <metric>.ixs files, which are tabular data that can be read into a R data frame.

Eye tracking metrics mentioned in this guide:

  • first pass times (fp):the sum of all fixations in a region until the first fixation to the region’s left or right
  • right-bounded times (rb): the sum of all fixations in a region until the first fixation to the region’s right
  • second pass times (sp): the sum of all fixations in a region after the initial first pass fixations

EyeDry also detects if and where a subject blinked while reading a sentence during an experimental trial. Trials with too many blinks on a specified critical region are excluded because no eye movement information can be recorded when a subject blinks.

The critical region in EyeDry is specified by linear order. However, in RC the critical RC region is the linearly 3rd region for SUB items, but the linarly 5th region for OBJ items.

  • SUB items: Unsurprisingly, | the diplomat | who attacked the senator3 | avoided | the reporter | …
  • OBJ items: Unsurprisingly, | the diplomat | avoided | the reporter | who attacked the senator5 | …

To solve this issue, the RC data was split into two files, one with data from the SUB item trials and one with data from the OBJ item trials. EyeDry was run separately on each file.

In the RC experiment, the data for each eye movement metric is therefore contained in two files:

  • <metric>-rcs.ixs: eye movement metric data from SUB item trials
  • <metric>-rco.ixs: eye movement metric data from OBJ item trials.

During analysis, the corresponding <metric>-rcs.ixs and <metric>-rco.ixs files for each metric must be joined for a complete set of values.

2.6 First pass times

A first pass time is the sum of all fixations made in a given region from the first time the point of fixation enters the region until the the first time the point of fixation leaves the region. To learn more about first pass times, read Eye Movements in Reading Words and Sentences.

First pass times are an “early” eye tracking metric, meaning that first pass fixations are associated with initial sentence processing processes.

Most examples in this guide use first pass times data.

 

3 Data manipulation

This section uses dplyr functions to manipulate first pass time time data from the RC experiment. For a review on dplyr, read the Tidyverse Data Manipulation Quickstart.

3.1 .ixs files

The first pass (fp) time data is contained within fp-rcs.ixs and fp-rco.ixs. They are comma-delimited files in which the first row contains the column names.

The full .ixs files that appear in this guide are currently not available to the public.

First 8 lines of fp-rcs.ixs:

seq,subj,item,cond,region,datum
26,1,1,5,1,318
26,1,1,5,2,311
26,1,1,5,3,488
26,1,1,5,4,350
26,1,1,5,5,171
26,1,1,5,6,356
26,1,1,5,7,696

First 8 lines of fp-rco.ixs:

seq,subj,item,cond,region,datum
133,1,3,7,1,323
133,1,3,7,2,417
133,1,3,7,3,
133,1,3,7,4,216
133,1,3,7,5,527
133,1,3,7,6,218
133,1,3,7,7,

The columns are:

  • seq: sequence number
  • subj: subject number (40 subjects)
  • item: item number (16 experimental items)
  • cond: condition number (4 conditions)
  • region: region number (7 regions per experimental item)
  • datum: length of fixation in ms

Each experimental item is split into 7 rows of data, one for each region.

3.2 Reading in data

Use dplyr::read_delim() to read in the .ixs files as a tibble:

# Read in first pass times for SUB items
fp_rcs <- read_delim("./fp-rcs.ixs", delim = ",", col_names = TRUE)

# Read in first pass times for OBJ items
fp_rco <- read_delim("./fp-rco.ixs", delim = ",", col_names = TRUE)

Print the newly created data frames:

# First pass times for SUB items
fp_rcs
# A tibble: 2,058 x 6
     seq  subj  item  cond region datum
   <dbl> <dbl> <dbl> <dbl>  <dbl> <dbl>
 1    26     1     1     5      1   318
 2    26     1     1     5      2   311
 3    26     1     1     5      3   488
 4    26     1     1     5      4   350
 5    26     1     1     5      5   171
 6    26     1     1     5      6   356
 7    26     1     1     5      7   696
 8    27     1     2     6      1   470
 9    27     1     2     6      2   133
10    27     1     2     6      3   405
# … with 2,048 more rows
# First pass times for OBJ items
fp_rco
# A tibble: 2,093 x 6
     seq  subj  item  cond region datum
   <dbl> <dbl> <dbl> <dbl>  <dbl> <dbl>
 1   133     1     3     7      1   323
 2   133     1     3     7      2   417
 3   133     1     3     7      3    NA
 4   133     1     3     7      4   216
 5   133     1     3     7      5   527
 6   133     1     3     7      6   218
 7   133     1     3     7      7    NA
 8    47     1     7     7      1   300
 9    47     1     7     7      2   205
10    47     1     7     7      3   184
# … with 2,083 more rows

3.3 Creating columns

We will separately create 2 columns on the fp_rcs and fp_rco data frames before combining the data frames.

  • pos: identify whether a row is from a SUB or OBJ item
  • ext: identify whether a row is from a SRC or ORC item

3.3.1 pos

The pos column identifies whether a row is from a SUB or OBJ item.

Use mutate():

# All items in `fp-rcs.ixs` are SUB items
fp_rcs <- read_delim("./fp-rcs.ixs", delim = ",", col_names = TRUE) %>%
  mutate(pos = "SUB")

# All items in `fp-rco.ixs` are OBJ items
fp_rco <- read_delim("./fp-rcs.ixs", delim = ",", col_names = TRUE) %>%
  mutate(pos = "OBJ")

3.3.2 ext

The ext column identifies whether a row is from a SRC or ORC item.

The value of a row’s ext column can be created from the cond column, which identifies the experimental condition of that row/item:

  • cond == 5: SUB-SRC
  • cond == 6: SUB-ORC
  • cond == 7: OBJ-SRC
  • cond == 8: OBJ-ORC

fp-rcs.ixs only contains SUB items (rows with a cond value of 5 or 6):

  • If cond == 5, then ext == SRC
  • Else, cond == 6 and ext == ORC

fp-rco.ixs only contains OBJ items (rows with a cond value of 7 or 8):

  • If cond == 7, then ext == SRC
  • Else, cond == 8 and ext == ORC

Use mutate() and dplyr::if_else():

# If `cond` is `5`, then the item is an SRC, else an ORC.
fp_rcs <- read_delim("./fp-rcs.ixs", delim = ",", col_names = TRUE) %>%
  mutate(pos = "SUB",
         ext = if_else(cond == 5, "SRC", "ORC"))

# If `cond` is `7`, then the item is an SRC, else an ORC.
fp_rco <- read_delim("./fp-rco.ixs", delim = ",", col_names = TRUE) %>%
  mutate(pos = "OBJ",
         ext = if_else(cond == 7, "SRC", "ORC"))

3.4 Combining data frames

Combine the fp_rcs and fp_rco data frames into a single data frame with dplyr::bind_rows():

fp <- bind_rows(fp_rcs, fp_rco)

fp
# A tibble: 4,151 x 8
     seq  subj  item  cond region datum pos   ext  
   <dbl> <dbl> <dbl> <dbl>  <dbl> <dbl> <chr> <chr>
 1    26     1     1     5      1   318 SUB   SRC  
 2    26     1     1     5      2   311 SUB   SRC  
 3    26     1     1     5      3   488 SUB   SRC  
 4    26     1     1     5      4   350 SUB   SRC  
 5    26     1     1     5      5   171 SUB   SRC  
 6    26     1     1     5      6   356 SUB   SRC  
 7    26     1     1     5      7   696 SUB   SRC  
 8    27     1     2     6      1   470 SUB   ORC  
 9    27     1     2     6      2   133 SUB   ORC  
10    27     1     2     6      3   405 SUB   ORC  
# … with 4,141 more rows

3.5 More manipulations

For manipulations that are independent of the SUB and OBJ items, it is more efficient to perform them on a combined fp data frame than to perform them twice on separate data frames:

  • Factors: coercing character vectors into factors
  • cond2: identify whether a row is from a SUB-SRC, SUB-ORC, OBJ-SRC, or OBJ-ORC item
  • region2: identify whether a row is the intro, msub, RC, spillover, mverb, mobj, or final region of an item

3.5.1 Coercing factors

The newly created pos and ext columns are character class vectors:

# Check the class of the `pos` column
class(fp$pos)
[1] "character"
# Check the class of the `ext` column
class(fp$ext)
[1] "character"

However, in this experiment, they should actually be factors with two levels.

Use mutate() and base::factor() to coerce the pos and ext columns into factors:

fp <- bind_rows(fp_rcs, fp_rco) %>% 
  # Coerce `pos` and `ext` columns into factors with the specified levels
  mutate(pos = factor(pos, levels = c("SUB", "OBJ")),
         ext = factor(ext, levels = c("SRC", "ORC")))
# Check the class of the `pos` column
class(fp$pos)
[1] "factor"
# Check the class of the `ext` column
class(fp$ext)
[1] "factor"

3.5.2 cond2

The cond2 column identifies whether a row is from a SUB-SRC, SUB-ORC, OBJ-SRC, or OBJ-ORC item.

Concatenate the pos and ext columns with stringr::str_c(), and coerce cond2 into a factor for analysis:

fp <- bind_rows(fp_rcs, fp_rco) %>% 
  mutate(pos = factor(pos, levels = c("SUB", "OBJ")),
         ext = factor(ext, levels = c("SRC", "ORC")),
         # Create `cond2` column
         cond2 = str_c(pos, ext, sep = "-"),
         # Coerce `cond2` into a factor
         cond2 = factor(cond2, levels = c("SUB-SRC", "SUB-ORC", "OBJ-SRC", "OBJ-ORC")))

3.5.3 region2

The region2 column identifies whether a row is the intro, msub, RC, spillover, mverb, mobj, or final region of an item.

The value of a row’s region2 column can be determined by the value of the region and pos columns, following the region orders.

  • If region == 1, then region2 == "intro"
  • Else if region == 2, then region2 == "msub"
  • Else if region == 3 and pos == "SUB", then region2 == "RC"
  • Else if region == 3 and pos == "OBJ", then region2 == "mverb"
  • etc.

dplyr::case_when() is the equivalent of nesting multiple dplyr::if_else() statements.

Use mutate() and case_when():

fp <- bind_rows(fp_rcs, fp_rco) %>% 
  mutate(pos = factor(pos, levels = c("SUB", "OBJ")),
         ext = factor(ext, levels = c("SRC", "ORC")),
         cond2 = str_c(pos, ext, sep = "-"),
         cond2 = factor(cond2, levels = c("SUB-SRC", "SUB-ORC", "OBJ-SRC", "OBJ-ORC")),
         # Create `region2` column
         region2 = case_when(region == 1 ~ "intro",
                             region == 2 ~ "msub",
                             region == 3 & pos == "SUB" ~ "RC",
                             region == 3 & pos == "OBJ" ~ "mverb",
                             region == 4 & pos == "SUB" ~ "spillover",
                             region == 4 & pos == "OBJ" ~ "mobj",
                             region == 5 & pos == "SUB" ~ "mverb",
                             region == 5 & pos == "OBJ" ~ "RC",
                             region == 6 & pos == "SUB" ~ "mobj",
                             region == 6 & pos == "OBJ" ~ "spillover",
                             region == 7 ~ "final"))

fp
# A tibble: 4,151 x 10
     seq  subj  item  cond region datum pos   ext   cond2   region2  
   <dbl> <dbl> <dbl> <dbl>  <dbl> <dbl> <fct> <fct> <fct>   <chr>    
 1    26     1     1     5      1   318 SUB   SRC   SUB-SRC intro    
 2    26     1     1     5      2   311 SUB   SRC   SUB-SRC msub     
 3    26     1     1     5      3   488 SUB   SRC   SUB-SRC RC       
 4    26     1     1     5      4   350 SUB   SRC   SUB-SRC spillover
 5    26     1     1     5      5   171 SUB   SRC   SUB-SRC mverb    
 6    26     1     1     5      6   356 SUB   SRC   SUB-SRC mobj     
 7    26     1     1     5      7   696 SUB   SRC   SUB-SRC final    
 8    27     1     2     6      1   470 SUB   ORC   SUB-ORC intro    
 9    27     1     2     6      2   133 SUB   ORC   SUB-ORC msub     
10    27     1     2     6      3   405 SUB   ORC   SUB-ORC RC       
# … with 4,141 more rows

 

4 Data analysis

This section provides examples of using dplyr functions to analyze first pass time data.

The .ixs files contain missing values in the datum column, so summary functions should include an na.rm = TRUE argument.

4.1 Winsorizing

Winsorization is a method of reducing the effect of outliers by replacing extreme values with less extreme values.

Install and load the psych package for the the psych::winsor() function:

install.packages("psych")
library(psych)

First pass times should be winsorized by experimental condition and region, because the different conditons and regions differ in length and difficulty.

Use group_by(), mutate(), and winsor() to winsorize by group:

# Winsorize first pass times by condition and region
fp <- fp %>%  
  group_by(cond2, region2) %>% 
  mutate(win = winsor(datum, trim = 0.1, na.rm = TRUE)) %>% 
  ungroup()

fp
# A tibble: 4,151 x 11
     seq  subj  item  cond region datum pos   ext   cond2   region2     win
   <dbl> <dbl> <dbl> <dbl>  <dbl> <dbl> <fct> <fct> <fct>   <chr>     <dbl>
 1    26     1     1     5      1   318 SUB   SRC   SUB-SRC intro      318 
 2    26     1     1     5      2   311 SUB   SRC   SUB-SRC msub       311 
 3    26     1     1     5      3   488 SUB   SRC   SUB-SRC RC         488 
 4    26     1     1     5      4   350 SUB   SRC   SUB-SRC spillover  350 
 5    26     1     1     5      5   171 SUB   SRC   SUB-SRC mverb      180.
 6    26     1     1     5      6   356 SUB   SRC   SUB-SRC mobj       356 
 7    26     1     1     5      7   696 SUB   SRC   SUB-SRC final      696 
 8    27     1     2     6      1   470 SUB   ORC   SUB-ORC intro      470 
 9    27     1     2     6      2   133 SUB   ORC   SUB-ORC msub       189 
10    27     1     2     6      3   405 SUB   ORC   SUB-ORC RC         405 
# … with 4,141 more rows

4.2 Means and standard errors

The RC experiment looks at the effects of pos and ext on relative clause processing difficulty. Increased processing difficulty is associated with longer first pass time durations.

We should look at the average first pass time (with standard errors) for each condition and region group.

Install and load the plotrix() package for the the plotrix::std.error() function:

install.packages("plotrix")
library(plotrix)

Use group_by(), summarize(), and std.error() to calculate mean first pass times with standard errors:

# Calculate mean first pass times by condition and region
fp %>% 
  group_by(region2, cond2) %>% 
  summarize(avg = mean(win, na.rm = T), ste = std.error(win, na.rm = T))
# A tibble: 28 x 4
# Groups:   region2 [7]
   region2 cond2     avg   ste
   <chr>   <fct>   <dbl> <dbl>
 1 final   SUB-SRC  585. 26.7 
 2 final   SUB-ORC  580. 23.2 
 3 final   OBJ-SRC  545. 22.1 
 4 final   OBJ-ORC  545. 22.2 
 5 intro   SUB-SRC  523. 19.7 
 6 intro   SUB-ORC  567. 20.5 
 7 intro   OBJ-SRC  522. 21.0 
 8 intro   OBJ-ORC  526. 19.7 
 9 mobj    SUB-SRC  308.  8.86
10 mobj    SUB-ORC  312. 10.0 
# … with 18 more rows

 

5 Data visualization

ggplot2 is a tidyverse package for creating graphics. This section uses ggplot2 to provide examples of tidyverse data visualization, but is not intended to be an introduction to ggplot2. To learn more about data visualization, read Chapter 3 “Data visualization” of R for Data Science, by Garrett Grolemund and Hadley Wickham.

5.1 RC region

The RC experiment has a 2x2 factorial design. Line graphs are good for visualizing 2x2 factorials.

To learn more about factorial design visualization, see Chapter 10.2 “Interpreting main effects and interactions” of Answering Questions with Data by Matthew Crump.

Create a line graph for the most important region, the critical RC region:

# Pick out first pass data from the RC region only, and drop rows with missing values
fp_RC <- fp %>% 
  filter(region2 == "RC") %>% 
  drop_na() 

# Create a line graph for `fp_RC` / the RC region
ggplot(fp_RC, aes(x = pos, y = win, group = ext, color = ext)) +
  labs(title = "First Pass Times", x = "Matrix position (pos)", 
       y = "Winsorized fixation time (ms)", color = "Extraction type (ext)", 
       subtitle = "RC region") +
  stat_summary(fun.y = mean, geom = "line") +
  stat_summary(fun.y = mean, geom = "point") +
  stat_summary(fun.data = mean_se, geom = "errorbar", width = 0.25) +  
  theme(legend.position = "bottom", legend.box = "horizontal",
        legend.background = element_rect(fill = "grey90"),
        strip.background = element_rect(fill="grey90"))