Course in Soft Computing:
An introduction to R as a statistical programming environment for
the analysis of quantitative linguistic data (1 - 2 cr)
Lecturer
Harald Baayen
(Interfaculty research unit for language and speech, University of Nijmegen & Max Planck Institute for Psycholinguistics, Nijmegen)
Location
Department of General Linguistics
Siltavuorenpenger 20 A
Helsinki
Duration
13 - 17 December 2004
Course description
R is an open source implementation of the S language and
environment for data analysis originally developed at Bell
Laboratories. R is the focus of this course because it is an
elegant object-oriented system with excellent graphical
facilities, because it has a consistent uniform syntax for
specifying statistical models, no matter which type of model is
being fitted, and because it is a programming language in which
new ideas, or applications for specific data sets, are easy to
implement.
The morning sessions of this course will consist of plenary
presentations in which I will introduce R and a selection of
techniques that are especially useful for the analysis of
quantitative linguistic data. The afternoons will be practical
sessions in which participants in this course will obtain
hands-on experience with these techniques applied to full-scale,
actual linguistic data sets. In this course, the emphasis is,
on the one hand, on learning to use graphical tools to explore
the quantitative structure of linguistic data sets, and on the
other hand, on learning what statistical tools might be useful
for a given data set, and how to apply them in R.
This 5-day course is structured as follows:
Day 1: Introduction
- Morning session, part 1:
Introduction to the R programming language (which is similar to that
of matlab), and a series of basic statistical functions).
- Morning session, part 2:
Introduction to data visualization and trellis graphics.
- Afternoon sessions:
Hands-on sessions with simple problems to get familiarized with R
Day 2: Multivariate analysis
- Morning session, part 1:
Principal components analysis and correspondence analysis, biplots
- Morning session, part 2:
Cluster analysis, multidimensional scaling
- Afternoon sessions:
Hands-on practice with the data discussed in Baayen (1994),
Baayen et. al (1996): principal components analysis in stylometry
and authorship attribution;
Baayen and Moscoso del Prado Martín:
multidimensional scaling for regular and irregular
verbs in lexical co-occurrence space
Moscosco del Pardo Martín and Baayen (in press), Baayen et al.
(2004b): hierarchical cluster analyses on lexical variables.
Day 3: Classification methods
- Morning session, part 1:
Classification and regression trees, the tree and
rpart libraries
- Morning session, part 2:
Discriminant analysis, support vector machines
- Afternoon sessions:
Hands-on practice with the data discussed in
Ernestus and Baayen (2003):
classification trees for predicting final devoicing
in Dutch; Baayen et al. (1996): discriminant analysis in
authorship attribution;
Cueni et al. (2004): classification trees and support vector
machines for predicting the dative alternation.
Day 4: Regression
- Morning session, part 1:
Linear regression (including basic analysis of variance), relaxing
the linearity assumption (quadratic terms, restricted cubic splines)
- Morning session, part 2:
Regression diagnostics (strategies, graphical possibilities),
detecting and dealing with collinearity.
- Afternoon sessions:
Hands-on practice with the data discussed in Baayen et al. (2004b):
collinearity in lexical predictors for reaction
time data; also data from current work with Mirjam Ernestus and
Mark Pluymaekers on frequency effects in the acoustic signal of
morphologically complex words.
Day 5: Advanced regression
- Morning session, part 1:
Mixed effect models (the nlme library of
Pinheiro and Bates (2000))
- Morning session, part 2:
Logistic regression and proportional odds models (the Design
and Hmisc libraries of Harrell (2001)).
- Afternoon sessions:
Hands-on practice with the data discussed in
Baayen and Moscoso del Prado Martín (2004),
Tabak et al. (2004): logistic regression for
estimating the likelihood of a verb being regular;
Cueni et al. (2004): logistic regression for predicting
the dative alternation; Baayen et al. (2004a):
mixed effect modeling for reaction time data;
Keune and Baayen (2004): mixed effect modeling
for register variation.
The standard reference to the S language is Becker et al. (1988),
highly recommended books on statistical modeling in R (S) are
Chambers and Hastie (1992) and Venables and Ripley (1994).
An introduction to statistics in R is Dalgaard (2002).
Preparatory work
Browse through the R page at
http://www.r-project.org/.
If you have a particular data set that you would like to have advice on,
bring it along to the course (on CDROM or floppy disk). The data should
preferably be tab-delimited ASCII text.