## Graduate School of Language Technology in Finland## Kieliteknologian valtakunnallinen tutkijakoulu - Språkteknologiska forskarskolan i Finland |

Harald Baayen (Interfaculty research unit for language and speech, University of Nijmegen & Max Planck Institute for Psycholinguistics, Nijmegen)

Department of General Linguistics

Siltavuorenpenger 20 A

Helsinki

13 - 17 December 2004

R is an open source implementation of the S language and environment for data analysis originally developed at Bell Laboratories. R is the focus of this course because it is an elegant object-oriented system with excellent graphical facilities, because it has a consistent uniform syntax for specifying statistical models, no matter which type of model is being fitted, and because it is a programming language in which new ideas, or applications for specific data sets, are easy to implement.

The morning sessions of this course will consist of plenary presentations in which I will introduce R and a selection of techniques that are especially useful for the analysis of quantitative linguistic data. The afternoons will be practical sessions in which participants in this course will obtain hands-on experience with these techniques applied to full-scale, actual linguistic data sets. In this course, the emphasis is, on the one hand, on learning to use graphical tools to explore the quantitative structure of linguistic data sets, and on the other hand, on learning what statistical tools might be useful for a given data set, and how to apply them in R.

This 5-day course is structured as follows:

- Morning session, part 1:

Introduction to the R programming language (which is similar to that of matlab), and a series of basic statistical functions). - Morning session, part 2:

Introduction to data visualization and trellis graphics. - Afternoon sessions:

Hands-on sessions with simple problems to get familiarized with R

- Morning session, part 1:

Principal components analysis and correspondence analysis, biplots - Morning session, part 2:

Cluster analysis, multidimensional scaling - Afternoon sessions:

Hands-on practice with the data discussed in Baayen (1994), Baayen et. al (1996): principal components analysis in stylometry and authorship attribution; Baayen and Moscoso del Prado Martín: multidimensional scaling for regular and irregular verbs in lexical co-occurrence space Moscosco del Pardo Martín and Baayen (in press), Baayen et al. (2004b): hierarchical cluster analyses on lexical variables.

- Morning session, part 1:

Classification and regression trees, the`tree`and`rpart`libraries - Morning session, part 2:

Discriminant analysis, support vector machines - Afternoon sessions:

Hands-on practice with the data discussed in Ernestus and Baayen (2003): classification trees for predicting final devoicing in Dutch; Baayen et al. (1996): discriminant analysis in authorship attribution; Cueni et al. (2004): classification trees and support vector machines for predicting the dative alternation.

- Morning session, part 1:

Linear regression (including basic analysis of variance), relaxing the linearity assumption (quadratic terms, restricted cubic splines) - Morning session, part 2:

Regression diagnostics (strategies, graphical possibilities), detecting and dealing with collinearity. - Afternoon sessions:

Hands-on practice with the data discussed in Baayen et al. (2004b): collinearity in lexical predictors for reaction time data; also data from current work with Mirjam Ernestus and Mark Pluymaekers on frequency effects in the acoustic signal of morphologically complex words.

- Morning session, part 1:

Mixed effect models (the`nlme`library of Pinheiro and Bates (2000)) - Morning session, part 2:

Logistic regression and proportional odds models (the`Design`and`Hmisc`libraries of Harrell (2001)). - Afternoon sessions:

Hands-on practice with the data discussed in Baayen and Moscoso del Prado Martín (2004), Tabak et al. (2004): logistic regression for estimating the likelihood of a verb being regular; Cueni et al. (2004): logistic regression for predicting the dative alternation; Baayen et al. (2004a): mixed effect modeling for reaction time data; Keune and Baayen (2004): mixed effect modeling for register variation.

The standard reference to the S language is Becker et al. (1988), highly recommended books on statistical modeling in R (S) are Chambers and Hastie (1992) and Venables and Ripley (1994). An introduction to statistics in R is Dalgaard (2002).

Browse through the R page at http://www.r-project.org/. If you have a particular data set that you would like to have advice on, bring it along to the course (on CDROM or floppy disk). The data should preferably be tab-delimited ASCII text.

Last updated: