Research Seminar in Language Technology - Spring 2016

Course Information

This is the time table of the general research seminar in language technology. Presentations will be given primarily by doctoral students but also other researchers in language technology including external guests

Place and Time:

Registration and Announcements:

  • Register for the mailing list lt-research.

Tentative Schedule:

21 January: Niklas Laxström
Place: U35 sh 114
Title: Internationalisation and localisation of spoken dialogue systems
Abstract: In contemporary software development, localisation is a straightforward process–assuming internationalisation has been considered during development. The localisation of spoken dialogue systems is less mature, possibly because they differ from common software in that interaction with them is situated and uses multiple modalities. We claim that it is possible to apply software internationalisation practices to spoken dialogue systems and that it helps the rapid localisation of such systems to new languages. We internationalised and localised the WikiTalk spoken dialogue system. During the process we identified needs relevant to spoken dialogue systems that will benefit from further research and engineering efforts.
4 February: Martti Vainio
Place: U35 sh 114
Title: Phonetics and speech synthesis research at UH
11 February: Whole-day workshop!
Title: Cross-Lingual and Multilingual NLP
18 February: Lightning talk seminar
Place: U40 sali 27
Title: An overview of LT research in Helsinki
3 March: Jussi Piitulainen and Hanna Nurmi
Place: U35 sh 114
Title: Mapping the FinnTreeBank model to Universal Dependencies
10 March: FIN-CLARIN
Place: U40 sali 5
Title: The FIN-CLARIN Road Show
17 March: Anna Dannenberg
Place: U35 sh 114
Title: Prosodic and syntactic structures in spontaneous English speech

Abstract: In our study we examine prosodic and syntactic structures of spoken language. By wavelet-based analysis (Continuous Wavelet Transform, CWT), the prosodic hierarchy of speech can be visually represented as a tree diagram resembling syntactic trees. This enables a novel method to compare prosodic and syntactic hierarchical structures in spoken language.

A CWT based tool represents prosodic signals of speech (f0, energy, etc.) in a multidimensional time-frequency scale-space akin to spectrograms, so that the structures that are not visible in the surface (time waveform, f0 contour, energy envelope) are rendered visible. The method can be further enhanced with lines of maximum amplitude, resulting in a visual tree representation of the prosodic hierarchies of speech.

In our recent research we have segmented a sample of spontaneous American English speech both prosodically and syntactically. The prosodic segmentation has been conducted using an automatic and unsupervised CWT tool, and the syntactic boundaries have been annotated by a group of students, the automatic English parsers having turned out not to be reliable enough for spontaneous speech. The demarcation and internal structure of different segments have then been analyzed in various respects, using e.g. automatically produced prosodic and syntactic tree diagrams.

Our results indicate that in spontaneous English speech, prosodic and syntactic boundaries have a certain tendency to co-occur, but there are also notable differences in their typical positions. In addition, the internal structures of prosodic and syntactic units differ considerably from each other. The most apparent divergence is in the predominant direction of branching: syntactic trees of spoken English language tend to be mainly right-branching, whereas in prosodic trees left- and right-branching structures alternate. It thus seems that prosodic and syntactic structures of spontaneous speech largely follow different patterns in their appearance and perhaps also in their formation.

31 March: Heidi Jauhiainen
Place: U35 sh 114
Title: Finno-Ugric Languages and the Internet project
7 April: Canceled! Lauri Carlson, Kun Ji, Shanshan Wang
Place: U35 sh 114
14 April: Jörg Tiedemann
Place: U35 sh 114
Title: Discourse-Level Machine Translation
Machine translation usually ignores discourse-level phenomena. Sentences are translated in isolation ignoring all surrounding context and the inter-connectedness of sentences. Common statistical MT models apply even more drastic restrictions and focus on very narrow windows inside of a sentence when translating an input string from a source to a target language. From a linguistic point of view, this is not very satisfactory and from the end-user perspective this leads to unacceptable results with respect to textual coherence. We currently run a project that tries to integrate long-distance dependencies even across sentence boundaries in the general framework of phrase-based statistical MT. A document-level decoder that implements a new translation search strategy has been implemented as a general framework for our experiments. In the talk, I will present the tools and a few models that we have implemented focusing on lexical cohesion and pronominal anaphora. I am also looking forward to discuss your ideas and suggestions.
21 April: Lauri Carlson, Kun Ji, Shanshan Wang
Place: U35 sh 114
Title: TermFactory
12 May: Timo Honkela, University of Helsinki
Place: U35 sh 114
Title: Epistemological status of linguistic theories and models

The presentation first presents results of experiments in which two popular statistical methods were used to model semantic relations of words (Lindh-Knuutila and Honkela 2015). We apply Independent Component Analysis (ICA) and probabilistic topic model (Latent Dirichlet Allocation, LDA) to create semantic representations from a large text corpus. We further compare the obtained results to two semantically labeled dictionaries.

The second part of the presentation discusses some practical, methodological and philosophical aspects related to theory-driven and statistical modeling (see, e.g., Honkela 2007). Relativity has been considered to be a problematic notion but convincing arguments have been presented concerning its importance (see, e.g., Rorty 1991). What is the status of human crafted models used as a gold standard, assumed to be objective representations of linguistic phenomena? Do different, often mutually incompatible representations provide different points of view rather than any of them having an objective status? If the big picture is more complicated than traditionally viewed within linguistics and computational linguistics, what is a sensible way forward? How should linguistic theories and computational models be evaluated? What are the consequences regarding a broader range of scientific disciplines if the semantic relativity includes contextuality and subjectivity of interpretation of linguistic signs (Sintonen et al. 2014)?


  • Timo Honkela. Philosophical aspects of neural, probabilistic and fuzzy modeling of language use and translation. Proceedings of IJCNN 2007, pp. 2881-2886.
  • Tiina Lindh-Knuutila & Timo Honkela. Exploratory analysis of semantic categories: comparing data-driven and human similarity judgments. Computational Cognitive Science, 2015, 1:2
  • Richard Rorty. Objectivity, relativism, and truth. Cambridge University Press, 1991.
  • Henry Sintonen, Juha Raitio and Timo Honkela. Quantifying the effect of meaning variation in survey analysis. Proc. of ICANN 2014, pp. 757-764.
19 May: BAULT Seminar with Jefrey Lijffijt (Ghent)
Place: U35 sh 114
Title: Word occurrence distributions, and implications for statistical analysis and comparison of corpora

Corpus linguistics hinges on statistical analysis of corpora. To conduct such analysis (e.g., regression or hypothesis testing) well, it is imperative to know the form of models that fit the observations well. To find an answer to this question, I will first present some observations regarding word and collocate occurrence statistics. I will study both occurrence distributions inside texts as well as across texts. Essentially, we review to what extent texts and corpora can be viewed as homogenous entities. Secondly, I will discuss the implications of these observations for statistical comparison of natural language corpora; e.g., for analysis of language change, finding differences between subcorpora, and text segmentation.

Bio: Jefrey Lijffijt is a Postdoc in Data Science at the Ghent University, Belgium. He obtained his D.Sc. (Tech.) diploma with distinction in December 2013 from Aalto University, Finland. His thesis on computational methods to analyse databases of sequences received the Best Doctoral Thesis of 2013 award from the Aalto University School of Science. He obtained a BSc and MSc degree in Computer Science from Utrecht University in 2006 and 2008, respectively. He has worked as a research intern at Philips Research, Eindhoven, as a consultant in predictive analytics at Crystalloids, Amsterdam, and as a postdoctoral researcher at Aalto University, Finland and University of Bristol, UK. His research interests include (visual interactive) mining of interesting/surprising patterns in transactional, sequential, and relational data, including graphs, as well as text mining, natural language processing, statistical significance testing, maximum entropy modelling, and subjective interestingness in data analysis.

9 June: canceled (go to FIN-CLARIN event instead!)
16 June: Jyrki Niemi
Place: Metsätalo, common room at the 6th floor (B610?)
Title: Modelling and reasoning with the semantics of temporal expressions using finite-state methods
14 July: Claire de Maricourt
Place: U40, common room at the 6th floor (B610?)
Title:Different methods for accelerating multilingual word alignment
The main problem I consider is to align words in parallel texts with many languages using a Bayesian model. To do so a "hidden" source text is introduced that is assumed to generate all the other texts, and the task is to find alignments between the hidden source text and the different translations. The problem I have been working on is how to do this efficiently, since the collapsed Gibbs sampler used previously does not parallelize well. First I used a partially collapsed sampler which allows parallelization but requires more calculations. Unfortunately, even though it worked, convergence is too slow. I am currently working on another approach, using Metropolis-Hastings sampling to improve parallelization of the collapsed sampler.