This is the time table of the general research seminar in language technology. Presentations will be given primarily by doctoral students but also other researchers in language technology including external guests
Abstract: In our study we examine prosodic and syntactic structures of spoken language. By wavelet-based analysis (Continuous Wavelet Transform, CWT), the prosodic hierarchy of speech can be visually represented as a tree diagram resembling syntactic trees. This enables a novel method to compare prosodic and syntactic hierarchical structures in spoken language.
A CWT based tool represents prosodic signals of speech (f0, energy, etc.) in a multidimensional time-frequency scale-space akin to spectrograms, so that the structures that are not visible in the surface (time waveform, f0 contour, energy envelope) are rendered visible. The method can be further enhanced with lines of maximum amplitude, resulting in a visual tree representation of the prosodic hierarchies of speech.
In our recent research we have segmented a sample of spontaneous American English speech both prosodically and syntactically. The prosodic segmentation has been conducted using an automatic and unsupervised CWT tool, and the syntactic boundaries have been annotated by a group of students, the automatic English parsers having turned out not to be reliable enough for spontaneous speech. The demarcation and internal structure of different segments have then been analyzed in various respects, using e.g. automatically produced prosodic and syntactic tree diagrams.
Our results indicate that in spontaneous English speech, prosodic and syntactic boundaries have a certain tendency to co-occur, but there are also notable differences in their typical positions. In addition, the internal structures of prosodic and syntactic units differ considerably from each other. The most apparent divergence is in the predominant direction of branching: syntactic trees of spoken English language tend to be mainly right-branching, whereas in prosodic trees left- and right-branching structures alternate. It thus seems that prosodic and syntactic structures of spontaneous speech largely follow different patterns in their appearance and perhaps also in their formation.
The presentation first presents results of experiments in which two popular statistical methods were used to model semantic relations of words (Lindh-Knuutila and Honkela 2015). We apply Independent Component Analysis (ICA) and probabilistic topic model (Latent Dirichlet Allocation, LDA) to create semantic representations from a large text corpus. We further compare the obtained results to two semantically labeled dictionaries.
The second part of the presentation discusses some practical, methodological and philosophical aspects related to theory-driven and statistical modeling (see, e.g., Honkela 2007). Relativity has been considered to be a problematic notion but convincing arguments have been presented concerning its importance (see, e.g., Rorty 1991). What is the status of human crafted models used as a gold standard, assumed to be objective representations of linguistic phenomena? Do different, often mutually incompatible representations provide different points of view rather than any of them having an objective status? If the big picture is more complicated than traditionally viewed within linguistics and computational linguistics, what is a sensible way forward? How should linguistic theories and computational models be evaluated? What are the consequences regarding a broader range of scientific disciplines if the semantic relativity includes contextuality and subjectivity of interpretation of linguistic signs (Sintonen et al. 2014)?
Corpus linguistics hinges on statistical analysis of corpora. To conduct such analysis (e.g., regression or hypothesis testing) well, it is imperative to know the form of models that fit the observations well. To find an answer to this question, I will first present some observations regarding word and collocate occurrence statistics. I will study both occurrence distributions inside texts as well as across texts. Essentially, we review to what extent texts and corpora can be viewed as homogenous entities. Secondly, I will discuss the implications of these observations for statistical comparison of natural language corpora; e.g., for analysis of language change, finding differences between subcorpora, and text segmentation.
Bio: Jefrey Lijffijt is a Postdoc in Data Science at the Ghent University, Belgium. He obtained his D.Sc. (Tech.) diploma with distinction in December 2013 from Aalto University, Finland. His thesis on computational methods to analyse databases of sequences received the Best Doctoral Thesis of 2013 award from the Aalto University School of Science. He obtained a BSc and MSc degree in Computer Science from Utrecht University in 2006 and 2008, respectively. He has worked as a research intern at Philips Research, Eindhoven, as a consultant in predictive analytics at Crystalloids, Amsterdam, and as a postdoctoral researcher at Aalto University, Finland and University of Bristol, UK. His research interests include (visual interactive) mining of interesting/surprising patterns in transactional, sequential, and relational data, including graphs, as well as text mining, natural language processing, statistical significance testing, maximum entropy modelling, and subjective interestingness in data analysis.