Advanced Quantitative Methods for Linguistic Data

Author

Morgan Sonderegger, Márton Sóskuthy, Massimo Lipari, Amanda Doucette

Published

July 16, 2025

Preface

This e-book is a study guide on quantitative methods for linguistic data, beyond the (2025) standard of frequentist linear and logistic (mixed-effects) regression modeling. Each chapter assumes that you have done readings indicated at the beginning of the chapter, each of which corresponds roughly to a week of a semester-long course (see “Context”).

Contributors

The authors are:

Chapters 1-9: Sonderegger
Chapters 10-11: Sóskuthy
Chapter 12: Sóskuthy & Sonderegger
Chapter 13: Lipari
Chapter 14: Doucette

We use datasets from (incomplete list):

Context

The last update to these notes was on the “Published” date above.

Each chapter includes:

Applications to linguistic data of concepts from the reading
Practical illustration of topics from the reading
Exercises

The motivation for these notes is a lack of (1) and/or (2) and/or (3) in existing resources. To help linguists develop their quantitative toolbox, this study guide gives:

Practical application to go with excellent existing readings on Bayesian regression models, primarily from McElreath (2020) (Chapters 1-8)
An applied introduction to methods which don’t currently have up-to-date published tutorials (Chapters 9, 12, 13, 14)
Existing tutorial materials in a modern (Quarto) format (Chapters 10-11).

These materials have been used in:

A graduate course (LING 683, Advanced Quantative Methods) taught in McGill Linguistics in Fall 2024. (Course schedule, including reading list)
The course “Bayesian Regression Modeling for Language Data: A Crash Course” at the 2025 LSA Institute (parts of Chapters 1-9)
Tutorials by Márton Sóskuthy (Chapters 10-12)

These notes were originally compiled with LING 683 in mind, but they are intended for use as a study guide for language scientists interested in expanding their quantitative toolbox. They can be seen as a follow-up to the material in Sonderegger (2023), but should be usable by readers who have learned similar material from a different source.

Here is the introduction to the course syllabus, which should give a sense of whether these materials could be helpful for you:

“This is a second course on quantitative methods for analyzing linguistic data. It follows LING 620, where we focused on regression modeling using R, up to linear and logistic mixed-effects models. Using this as a starting point, our goals are to broaden your conceptual knowledge and methodological toolkit of quantitative methods, in order to broaden the research questions you can ask and the types of data you can analyze. This term we will cover (a) Bayesian data analysis and (b) generalized additive (mixed) models, along the way introducing (c) model types beyond linear and logistic (e.g. multinomial, Poisson) and (d) possibly other current methods (e.g. functional data analysis). These methods are increasingly used to analyze linguistic data, but are relatively new to language scientists, and standard tools and best practices for practical applications are evolving. A theme of the course is practical application, and a primary goal is developing a sufficiently strong basis in (a)–(c) that you will be able to figure out the quantitative methods needed to analyze your data in the future.”

License

CC-BY-SA-4.0

Citation

Sonderegger, Morgan, Sóskuthy, Márton, Lipari, Massimo, and Doucette, Amanda. (2025) Advanced Quantitative Methods for Linguistic Data. Version 0.2. https://doi.org/10.5281/zenodo.15942068.