Chapter 10 Appendix: Datasets and packages
All datasets used in this book are publicly available, either as:
Datasets in the
languageR
R package (R. H. Baayen, 2013) associated with R.H. Baayen’s book Analyzing Linguistic Data:english
,regularity
datasetsOpen Science Framework (OSF) projects: all other datasets
10.1 english
lexical decision and naming latencies
Description from languageR
documentation:
“This data set gives mean visual lexical decision latencies and word naming latencies to 2284 monomorphemic English nouns and verbs, averaged for old and young subjects, with various predictor variables.”
The original source of this data is the English Lexicon Project.
To load this data and learn what different columns mean, execute in R:
library(languageR)
?english
In this book we often use RTlexdec
, Word
, AgeSubject
, WrittenFrequency
, LengthInLetters
.
10.2 Dutch regularity
This dataset, originally from the study reported by R Harald Baayen & Prado Martin (2005), is described in the languageR
documentation:
Regular and irregular Dutch verbs and selected lexical and distributional properties.
To load this data and learn what different columns mean, execute in R:
library(languageR)
?regularity
In this book we often use WrittenFrequency,
Verb,
Auxiliary,
Regularity`.
10.3 European French phrase-medial vowel devoicing
This dataset is from Torreira & Ernestus (2010), a study examining phrase-medial vowel devoicing in European French. The data is posted as french_medial_vowel_devoicing.txt
in the OSF project (Torreira & Ernestus, 2018).
10.3.1 Background
This is an example from the paper:

The french_medial_vowel_devoicing
data (or just devoicing
) consists of:
550 French syllables with voiceless obstruent onset (/pktsf/) followed by a high vowel (/iuy/),
located in non-final position within the intonational phrase (IP)
extracted from a corpus of spontaneous conversational speech (Torreira, Adda-Decker, & Ernestus, 2010).
One way to measure to what extent “devoicing” has occured is syllable duration.
The research question considered for this dataset (in this book) is: are function words (e.g. “qui”, “tu”, “si”) shorter than other syllables?
Variables:
Response variable: syllable duration (
syldur
)Predictors:
word type (function/content) (
func
)speech rate (
speechrate
)onset type (
c1
)vowel type (
v
)
10.4 North American English tapping
This data is from a speech production experiment by Kilbourn-Ceron (2017), Kilbourn-Ceron, Wagner, & Clayards (2017) examining tapping in North American English. The dataset, which contains only a subset of the data analyzed in the publications, is posted as tappedMcGillLing620.csv
in an OSF project (Wagner, Kilbourn-Ceron, & Clayards, 2018).
10.4.1 Background
In North American English, the sounds [t] and [d] can sometimes be optionally pronounced as [ɾ] (a “tap”) if followed by a vowel:
“For those of you who’d like to eat early, lunch will be served.”
“For those of you who’d like to eaɾ early, lunch will be served.”
According to earlier work, tapping interacts with syntax (e.g Scott & Cutler (1984)):
Sounds OK: “For those of you who’d like to eat, early lunch will be served.”
Doesn’t: “For those of you who’d like to eaɾ, early lunch will be served.”
It seems that a syntactic juncture following [t/d] makes tapping less likely. But is this true? The effect could result either from
A syntactic juncture
A stronger prosodic boundary, which correlates with a syntactic juncture
Kilbourn-Ceron (2017), Kilbourn-Ceron et al. (2017) report a production experiment to investigate.
Participants produced sentences like:
“If you plit, Alice will be mad”
“If you plit Alice, John will be mad”
for nonce words like “plit” ending in [t] or [d], followed by a vowel initial word (such as “Alice”).
The two manipulations are:
syntax
: the nonce word can be- intransitive (next word = following clause)
- transitive (next word = complement)
speakingRate
:- Normal vs. fast
The research questions are:
- How often did participants tap, depending on
speakingRate
andsyntax
- This is a categorical response variable,
tapped
(0/1).
- This is a categorical response variable,
- Does tapping rate depend on the prosodic juncture between words?
- The juncture strength is estimated using the duration of the preceding vowel:
vowelDuration
.
- The juncture strength is estimated using the duration of the preceding vowel:
10.5 halfrhyme
: English half-rhymes
This data is from a speech perception experiment by Harder (2013), examining what determines how “good” English speakers think imperfect rhymes are (e.g. time/tide). The dataset is posted as halfrhymeMcGillLing620.csv
in an OSF project (Wagner, 2018b).
10.6 givenness
data: the Williams Effect
This data is from a speech production experiment, reported in Wagner (2012), examining how information structure affects which words in a sentence are pronounced with more emphasis (“prominence”). The dataset is posted as givennessMcGillLing620.csv
in an OSF project (Wagner, 2018a).
10.6.1 Background
When we speak, some words have more emphasis than others: this is an aspect of prosody that linguists call prominence. (For example, if you read the preceding sentence out loud, “prominence” is probably emphasized.) The opposite of prominence is reduction, where a word is produced without any emphasis.
Constituents (e.g. words) that are “given” or salient in discourse (usually because they have just been used before) and constituents whose referents are highly salient (for example the referent of a pronoun may be given, even if the pronoun itself has not been used before) are often prosodically reduced. For example, the referring expression “John” in the second sentence below is unlikely to carry an accent, and instead prominence (shown with bold) is likely to be shifted to “greeted” (shifted compared to where prominence would have fallen had the first sentence not been there):
- “John finally arrived at the function. Mary greeted John.”
A systematic exception is cases in which a constituent is contrastive in addition to being given. Consider the following case, where the second sentence marks a double contrast: the agent “Mary” is contrasted with the agent of the previous sentence, “John”; in addition, the patient “John” is contrasted with the patient of the previous clause, “Mary”. The fact that both “John” and “Mary” are given does not seem to be relevant if they are also contrastive:
- “John kissed Mary. Then Mary kissed John.”
We can conclude that contrast (marked by accenting) trumps givenness (marked by lack of accentuation). This dataset was used by Wagner (2012) to investigate a systematic class of apparent counterexamples, which were first observed by Williams (1980). According to Williams, in these cases accenting a contrastive referring expressions sounds odd, contrary to what we would expect from a semantic point of view:
- # “John kissed Mary. Then John was kissed by Mary.”
That is, even though “Mary”" is contrastive in the second sentence, and thus should be accented, doing so sounds odd (indicated by “#”). Wagner’s experiment, and the analysis used in this book, tests whether this Williams effect is real, and how it affects the way speakers produce sentences like the one above (with “#”).
The dataset givenness
contains data from the experiment, where participants produced sentences of four types:
- John greeted Mary, and then John was greeted by Mary.
- Mary was greeted by John, and then John was greeted by Mary.
- John greeted her, and then John was greeted by her.
- She was greeted by John, and then John was greeted by her.
These four sentences constitute one item
in the dataset. There are many other items, each with a similar structure: two clauses, with a meaning equivalent to “noun\(_1\) verbed noun\(_2\), and then noun\(_2\) verbed noun\(_1\)”, for different choices of the verb, noun\(_1\), and noun\(_2\).
The sentences convey the same information in four ways. In each one, the second clause is passive, and the NP referring to Mary appears at the end of the clause. The four sentences differ in whether the NP referring to Mary is a full NP (her name) or a pronoun, and whether the NP referring to Mary appears at the end of the first clause, or not:
- Full NP, at the end of clause 1
- Full NP, not at the end of clause 1
- Pronoun, at the end of clause 1
- Pronoun, not at the end of clause 1
This experiment thus has a 2x2 design, with two within-item variables. Within an item, the sentences corresponding to #1-#4 were seen by four different participant
’s, and each participant saw exactly 1 sentence from each item. (Thus, this experiment has a latin-square design.)
The experiment examined what factors affect whether the final NP (henceforth the target NP) in sentences like #1-#4 is accented (as in (a) below), or whether the accent is shifted to an earlier word (as in (b)-(d)).
- John greeted Mary, and then John was greeted by Mary.
- John greeted Mary, and then John was greeted by Mary.
- John greeted Mary, and then John was greeted by Mary.
- John greeted Mary, and then John was greeted by Mary.
We consider four predictors which could affect whether stress is shifted:
conditionLabel
(within-item variable 1): Whether the target NP (the last word of the second clause) previously appears- at the beginning of clause 1 (level Contrast)
- at the end of clause 1 (level Williams), in which case both clauses end with the same word.
npType
(within-item variable 2): Whether the second NP is a full NP or a pronoun- corresponds to the two-level factor
npType
(levels: full, pronoun).
- corresponds to the two-level factor
voice
: Whether clause 2 is active or passive.- (Note that the values of
voice
andconditionLabel
together determine whether clause 1 is active or passive.)
- (Note that the values of
order
: Stimulus presentation order (within each participant).56
Why these predictors? In order:
- The main motivation of the experiment was the observation that sentences like the one marked with “#” above sound odd, at least when the final word is accented.
- Cases like
conditionLabel
=Williams, where two consecutive sentences end with identical accented phonological chunks, are dispreferred. - This is the called the Williams effect by Wagner.
- Cases like
Speakers might be more or less willing to stress pronouns versus full NPs in general, or the Williams effect might differ in strength between the two types of NPs.
Whether clause 2 is active or passive affects what word stress would be shifted to, which might also interfere with the Williams effect.
Finally, how much participants shift stress might change over the course of the experiment as they get used to the stimuli.
Examples in this book consider two response variables indicating whether prominence has shifted.
The variable
shifted
is a binary (0/1) factor indicating whether a research assistant heard prominence as shifted or not (levels shift, noshift)- The variable
acoustics
is a linear combination of various acoustic cues that acts as a proxy forshifted
.57- The acoustic cues are related to pitch, duration, and intensity of the target NP.
- A higher
acoustics
value correlates with a lowering of the promiinence of the target NP relative to earlier words in clause 2, and hence whether stress has been shifted from the target NP (higheracoustics
) or not (loweracoustics
).
10.7 alternatives
This data is from a speech production experiment, reported in Wagner (2016), examining how information structure affects which words in a sentence are pronounced with more emphasis (“prominence”). The dataset, which contains only a subset of the data analyzed in the paper, is posted as alternativesMcGillLing620.csv
in an OSF project (Wagner, 2018c).
10.7.1 Background
In the experiment, participants read sentences like “She brought a new bicycle”, where an adjective modifies a noun. The manipulation was whether a contrast to new was mentioned in the context. There were three conditions, captured by a three-level variable context
:
- New (No previous mention of bicycle)
- Ex: “Guess what John’s aunt, who is incredibly generous, brought for his birthday: A new bicycle!”
- NoAlternative (previous mention of bicycle, but no true alternatives to “new bicycle”)
- Ex: “Guess what John’s aunt, who produces expensive bicycles, brought for his birthday: A new bicycle!”
- Alternative (previous mention of bicycle, with a true alternative to “new bicycle”:)
The question of interest is: when do speakers shift prominence from “bicycle” to “new”? In condition (1), presumably they don’t, because “bicycle” has not been mentioned before. But what about (2) vs. (3)?
The response is the binary (0/1) variable shifted
, which catpures whether prominence was shifted to the adjective (as perceived by a research assistant).
10.8 VOT
This dataset contains voice onset times (VOTs) measured for speech from from a corpus of speakers of different British English dialects—a subset of the VOT data analyzed by Sonderegger, Bane, & Graff (2017). The data is posted as votMcGillLing620.csv
in an OSF project (Sonderegger, Bane, & Graff, 2018).
10.8.1 Background
The dataset contains VOT measurements (in msec) for 4728 word-initial voiceless stops, corresponding to 424 word types and 21 speakers. Only stops beginning with /t/ and /k/ are included (/p/-initial stops have been omitted, to simplify the analysis). Like most corpus data, this dataset is very unbalanced: there are between 1 and 1505 tokens per word type, and 35-619 tokens per speaker.
The dataframe contains columns corresponding to a number of variables, at the level of the speaker, the word, and individual observations:
- Speaker-level
- `maleSpeaker: 1/0 for tokens from male/female speakers.
speakingRateMean
: Mean speaking rate (syllables/second) for data points from this speaker.
- Word-level
poaVelar
: 1/0 when place of articulation is velar/alveolar (i.e. /k/ vs. /t/)followingHigh
: 1/0 when the following vowel is high/non-high (e.g. tea vs. tap)stressedSyll
: 1/0 when the syllable containing the stop is stressed/unstressed (including function words).
- Observation-level
speakingRateDev
: Speaking rate for this observation, minusspeakingRateMean
for this speaker.
The expected effects of some of these predictors on VOT can be intuitively understood as consequences of reduction. Faster speaking rate means (by definition) that speech is compressed, including the part of stop consonants corresponding to VOT. This compression might take place across speakers (those who talk faster, on average, have lower VOTs), within speakers (faster speech \(\Rightarrow\) lower VOT, for a single speaker), or both. Unstressed syllables are often realized as more reduced phonetically, corresponding to shorter segments (and hence shorter VOT). The source of the place of articulation effect on VOT is more mysterious (there are several proposed explanations), but has been observed across many languages (Cho & Ladefoged, 1999). The source of the gender effect is also not totally clear, and studies differ on whether men truly have lower VOTs than women, or whether they just speak faster than women, on average (which is true for English) (Morris, Mccrea, & Herring, 2008).
The response is logVOT
. We log-transform VOT because this brings its distribution closer to normality, and because VOT can only be positive for voiceless stops, hence a model which can predict negative VOT would not make sense.
10.9 Transitions
This dataset is from Seán G Roberts, Torreira, & Levinson (2015), a study examining the factors which determine the speed of turn-taking in conversation. The data is posted as transitions.txt
in an OSF project (Sean G Roberts, Torreira, & Levinson, 2018).
The dataset contains around 20,000 conversational transitions between speakers engaged in spontaneous conversation during telephone calls, from the Switchboard Corpus (Godfrey & Holliman, 1997). The dataset contains fifty variables, but ohly some of them are used in this book. Here is a brief explanation of some relevant variables (columns):
dur
: the duration (ms) of the floor transition between turn A (the turn preceding the floor transition), and turn B (the turn following the transition)spkA
: the id of the speaker of the turn preceding the floor transition.spkB
: the id of the speaker of the turn following the floor transition.sexA
: the sex of the speaker of the turn preceding the floor transition.sexB
: the sex of the speaker of the turn following the floor transition.dialActA
: the dialogue act (e.g. yes-no question, wh-question, statement, answer, backchannel) of the turn preceding the floor transition.dialActB
: the dialogue act of the turn following the floor transition.uttNSylA
: the number of syllables in the turn preceding the floor transition.uttNSylB
: the number of syllables in the turn following the floor transition.uttDurA
: the duration (ms) of the turn preceding the floor transition.uttDurB
: the duration (ms) of the turn following the floor transition.
Letter A is used in the name of variables referring to the turn preceding the floor transition, whereas letter B is used in the name of variables referring to the turn following a floor transition.
10.10 Packages
In addition to languageR
, packages used in this book include:
arm
(Gelman & Su, 2018)bookdown
(Xie, 2018)rms
andHmisc
(Harrell Jr, 2018; Harrell Jr, Charles Dupont, & others., 2018)influence.ME
(Nieuwenhuis, Te Grotenhuis, & Pelzer, 2012)“Tidyverse” packages
ggplot2
,dplyr
,tidyr
, etc. (Wickham, 2016; Wickham & Henry, 2018; Wickham, François, Henry, & Müller, 2018)
References
Baayen, R. H. (2013). LanguageR: Data sets and functions with “analyzing linguistic data: A practical introduction to statistics”. Retrieved from https://CRAN.R-project.org/package=languageR
Baayen, R. H., & Prado Martin, F. M. del. (2005). Semantic density and past-tense formation in three germanic languages. Language, 81(3), 666–698.
Torreira, F., & Ernestus, M. (2010). Phrase-medial vowel devoicing in spontaneous french. In 11th Annual Conference of the International Speech Communication Association (Interspeech 2010) (pp. 2006–2009).
Torreira, F., & Ernestus, M. (2018). French devoicing. Open Science Framework. Retrieved from osf.io/jverz
Torreira, F., Adda-Decker, M., & Ernestus, M. (2010). The Nijmegen corpus of casual French. Speech Communication, 52(3), 201–212.
Kilbourn-Ceron, O. (2017). Speech production planning affects variation in external sandhi (PhD thesis). McGill University.
Kilbourn-Ceron, O., Wagner, M., & Clayards, M. (2017). The effect of production planning locality on external sandhi: A study in /t/. In Proceedings of the 52nd Annual Meeting of the Chicago Linguistic Society.
Wagner, M., Kilbourn-Ceron, O., & Clayards, M. (2018). The effect of production planning locality on external sandhi: A study in /t/. OSF. Retrieved from osf.io/8rjxu
Scott, D. R., & Cutler, A. (1984). Segmental phonology and the perception of syntactic structure. Journal of Verbal Learning and Verbal Behavior, 23(4), 450–466.
Harder, L. (2013). Feature mismatch in half-rhymes. Unpublished MA Research Paper McGill University.
Wagner, M. (2018b). Asymmetries in half-rhyme. Open Science Framework. Retrieved from osf.io/k5pnu
Wagner, M. (2012). A givenness illusion. Language and Cognitive Processes, 27(10), 1433–1458. https://doi.org/http://dx.doi.org/10.1080/01690965.2011.607713
Wagner, M. (2018a). A givenness illusion. Open Science Framework. Retrieved from osf.io/r4j2w
Williams, E. (1980). Remarks on stress and anaphora. Journal of Linguistic Research, 1(3), 1–16.
Wagner, M. (2016). Information structure and production planning. In C. Féry & S. Ishihara (Eds.), The Oxford handbook of information structure (pp. 541–561). Oxford University Press.
Wagner, M. (2018c). Information structure and production planning. Open Science Framework. Retrieved from osf.io/dha2j
Sonderegger, M., Bane, M., & Graff, P. (2017). The medium-term dynamics of accents on reality television. Language, 93(3), 598–640.
Sonderegger, M., Bane, M., & Graff, P. (2018). Medium-term dynamics of accents on reality TV. OSF. Retrieved from osf.io/dmxuj
Cho, T., & Ladefoged, P. (1999). Variation and universals in VOT: evidence from 18 languages. Journal of Phonetics, 27(2), 207–229. https://doi.org/10.1006/jpho.1999.0094
Morris, R. J., Mccrea, C. R., & Herring, K. D. (2008). Voice onset time differences between adult males and females: Isolated syllables. Journal of Phonetics, 36(2), 308–317. https://doi.org/10.1016/j.wocn.2007.06.003
Roberts, S. G., Torreira, F., & Levinson, S. C. (2015). The effects of processing and sequence organization on the timing of turn taking: A corpus study. Frontiers in Psychology, 6, 1–16. https://doi.org/10.3389/fpsyg.2015.00509
Roberts, S. G., Torreira, F., & Levinson, S. C. (2018). The effects of processing and sequence organization on the timing of turn taking: A corpus study. OSF. Retrieved from osf.io/dve6h
Godfrey, J., & Holliman, E. (1997). Switchboard-1 Release 2. Philadelphia: Linguistic Data Consortium.
Gelman, A., & Su, Y.-S. (2018). Arm: Data analysis using regression and multilevel/hierarchical models. Retrieved from https://CRAN.R-project.org/package=arm
Xie, Y. (2018). Bookdown: Authoring books and technical documents with R markdown. Retrieved from https://github.com/rstudio/bookdown
Harrell Jr, F. E. (2018). Rms: Regression modeling strategies. Retrieved from https://CRAN.R-project.org/package=rms
Harrell Jr, F. E., Charles Dupont, & others. (2018). Hmisc: Harrell miscellaneous. Retrieved from https://CRAN.R-project.org/package=Hmisc
Nieuwenhuis, R., Te Grotenhuis, M., & Pelzer, B. (2012). Influence.ME: Tools for detecting influential data in mixed effects models. R Journal, 4(2), 38–47.
Lenth, R. (2016). Least-squares means: The R package lsmeans. Journal of Statistical Software, 69(1), 1–33. https://doi.org/10.18637/jss.v069.i01
Lenth, R. (2018). Emmeans: Estimated marginal means, aka least-squares means. Retrieved from https://CRAN.R-project.org/package=emmeans
Wickham, H. (2016). Ggplot2: Elegant graphics for data analysis. Springer-Verlag New York. Retrieved from http://ggplot2.org
Wickham, H., & Henry, L. (2018). Tidyr: Easily tidy data with ’spread()’ and ’gather()’ functions. Retrieved from https://CRAN.R-project.org/package=tidyr
Wickham, H., François, R., Henry, L., & Müller, K. (2018). Dplyr: A grammar of data manipulation. Retrieved from https://CRAN.R-project.org/package=dplyr
Although
order
can only take on a discrete set of values, in applications in this book we treat it as a continuous variable (not a factor).↩Importantly,
acoustics
was artificially constructed, and should not be used for any re-analysis or replication of the original experiment. In the original experiment, onlyshifted
was annotated, and various acoustic prosodic measures (pitch, duration and intensity of different words) automatically extracted. We constructed theacoustics
variable for teaching purposes, by running a logistic regression predictingshifted
from the prosodic measures, and taking the resulting linear predictor from the regression to beacoustics
. This allows us to analyze the same dataset using both linear regression (response =acoustics
) and logistic regression (response =shifted
).↩